Artificial intelligence agents are increasingly expected to navigate uncertainty, but their ability to ask the right questions often lags behind their capacity to answer them. A new study from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard University’s School of Engineering and Applied Sciences (SEAS) explores this gap by transforming the timeless game of Battleship into an experimental platform for evaluating machine inquiry.
From Battleship to AI Research: A Twist on a Classic Game
The researchers reimagined Battleship as a cooperative challenge where one AI acts as a "captain" that must locate hidden ships by asking strategic questions, while another plays the "spotter" and answers those queries in real time. Unlike traditional Battleship, where players guess coordinates directly, this version relies entirely on natural language to probe the spotter’s knowledge. To build a benchmark, the team first had over 40 human participants play the game, recording their questions and yes-no responses to create the BattleshipQA dataset. This human baseline served as a reference point when testing state-of-the-art language models (LMs), including GPT-5 and smaller models like Llama 4 Scout.
The results revealed an unexpected advantage: top-tier LMs could complete the game in fewer turns than humans, but smaller models struggled with irrational questioning patterns. The core challenge wasn’t answering the spotter’s replies—it was formulating questions that extract the most useful information. To address this, the researchers introduced a Monte Carlo inference strategy, which evaluates the likelihood of each possible ship configuration with every answer, allowing the captain to refine its inquiries dynamically.
Scaling Down AI’s Question-Asking Gap
The refinements paid off dramatically for smaller models. Llama 4 Scout, a relatively compact LM, initially outperformed humans in only 8% of games. After integrating the Monte Carlo strategy, its win rate soared to 82%. Even more striking, the optimized model surpassed GPT-5 in efficiency while operating at just 1% of the larger model’s computational cost. This efficiency suggests that cost-effective AI systems can match or exceed the performance of frontier models when equipped with the right reasoning tools.
Beyond Battleship, the team tested the approach in another deduction game: "Guess Who?" Here, models narrowed down 100 hidden characters by asking targeted questions. Llama 4 Scout’s success rate jumped from 30% to over 72%, while GPT-4o improved from 62% to 90%. The consistent gains across both games indicate that the Monte Carlo strategy enhances an AI’s ability to explore information spaces systematically, rather than relying on brute-force guessing.
Turning Questions into Code for Sharper Answers
The researchers also discovered that LMs could answer questions more accurately when inquiries were translated into executable code. For example, a question like "Is there a ship in column one that spans two rows?" was converted into Python commands instructing the spotter to search a specific grid area and verify the ship’s dimensions. This auto-formalization process—where natural language is transformed into structured instructions—led to measurable improvements. GPT-4o-mini saw a 30% performance boost, while even the larger Claude 4 Opus model gained eight percentage points in accuracy.
The technique mirrors broader trends in AI where language models generate code to validate their own solutions. "Auto-formalization has shown promise in verifying answers," says Jacob Andreas, an MIT associate professor and CSAIL principal investigator. "But what excites us is using it to improve how models explore and gather information from the start. This could unlock new capabilities in coding, scientific discovery, and beyond."
What’s Next for AI Asking and Learning
While the results are promising, the work highlights persistent challenges. Even optimized models sometimes struggle to balance efficiency with thoroughness, particularly in games requiring nuanced deductions. The researchers plan to expand their approach beyond board games into domains like medical diagnosis and software engineering, where AI agents must probe uncertain environments to make informed decisions. Their ultimate goal is to develop systems that don’t just answer questions but learn to ask the right ones first.
For now, the Battleship-inspired experiments offer a clear takeaway: smarter questioning isn’t just about bigger models—it’s about smarter strategies. And as AI continues to evolve, the ability to ask probing, targeted questions may prove just as valuable as the answers themselves.
AI summary
MIT ve Harvard araştırmacıları, dil modellerinin belirsiz ortamlarda daha etkili sorgulama yapmasını sağlamak için 'Battleship' oyununu yeniden tasarladı. Küçük modellerin bile büyük başarılara imza attığı bu çalışma hakkında detaylar.