Reverse engineering reveals cognitive gaps in current AI systems
Researchers identified two consistent failure modes in LLM reasoning: overcomplication and overlooking. In the overcomplication scenario, the model constructs unnecessarily complex rules, misinterpreting simple data patterns. This was particularly common in list-mapping tasks (Programs). In contrast, in the overlooking mode, observed often in Math Equations, LLMs failed to integrate crucial details from the data, opting for vague or generic hypotheses.

As interest intensifies around the use of large language models (LLMs) in automating research and scientific workflows, a new study has posed a critical question: how effective are LLMs at emulating the fundamental process of reverse engineering in science? The paper, titled “Are Large Language Models Reliable AI Scientists? Assessing Reverse-Engineering of Black-Box Systems”, published on arXiv by researchers at Princeton University, investigates the capacity of LLMs to infer hidden mechanisms behind systems through observation and experimentation - core capabilities needed for scientific discovery.
The authors tested state-of-the-art LLMs, including GPT-4o, on three types of black-box systems representing symbolic programming, formal language generation, and mathematical modeling. These controlled environments emulate tasks that scientists routinely face: observing patterns, hypothesizing rules, and designing experiments. The goal was to evaluate whether LLMs can passively learn system logic or must engage actively to deduce it.
Their findings are unequivocal. LLMs falter when merely observing data without intervention. Only when allowed to design and execute queries, mimicking an experimenter’s behavior, do they significantly improve in identifying the system's logic. Yet even then, performance still lags behind ideal Bayesian inference models, suggesting limitations in the scientific reasoning capacity of current AI systems.
Do LLMs learn better by intervening rather than observing?
To assess learning ability, researchers compared two settings: passive observation (feeding pre-collected input-output examples into the model) versus active intervention (allowing the model to generate new inputs and observe outputs). In all three task domains, Programs, Formal Languages, and Math Equations, LLMs like GPT-4o and Claude-3.5 plateaued quickly when only given observational data.
On the other hand, when these models were allowed to intervene by submitting their own queries to the black box, performance improved markedly. For instance, GPT-4o's scores in identifying formal language rules jumped substantially after 20 rounds of interventions. However, even with more data collected through interventions, the improvement was often not generalizable. Bayesian benchmarks, which reflect theoretically optimal inference, continued to outperform LLMs with similar or fewer data points.
This suggests that the act of generating the intervention plays a vital role in learning. Passive models presented with high-quality data collected by other models did not benefit nearly as much - a phenomenon mirroring human learning studies, where active learners outperform those who receive the same data passively.
What are the core weaknesses holding LLMs back?
Researchers identified two consistent failure modes in LLM reasoning: overcomplication and overlooking. In the overcomplication scenario, the model constructs unnecessarily complex rules, misinterpreting simple data patterns. This was particularly common in list-mapping tasks (Programs). In contrast, in the overlooking mode, observed often in Math Equations, LLMs failed to integrate crucial details from the data, opting for vague or generic hypotheses.
Allowing models to intervene helped reduce these issues. Intervention prompted models to test edge cases, revise faulty assumptions, and gradually converge on more accurate interpretations. Yet the complexity of the task affected outcomes: more intricate systems saw smaller gains from intervention, especially when cognitive loads increased.
The study also explored different reasoning strategies to enhance interventions. When LLMs verbalized their reasoning before generating queries, akin to chain-of-thought prompting, performance improved. The most effective strategy was “Analyze-then-Query,” where models reflected on previous observations before designing experiments.
Interestingly, structured reasoning strategies common in formal tasks did not consistently help with reverse engineering, highlighting a difference between formal logic tasks and scientific hypothesis testing.
Can intervention knowledge transfer between models?
A key practical question the study addressed is whether one LLM’s experimental data can benefit another model - a scenario relevant to collaborative AI. The answer appears to be no. When intervention data from GPT-4o was transferred to Llama-3.3-70B as passive input, performance increased only marginally and still underperformed compared to models that generated their own interventions.
This suggests that the learning benefit comes not just from better data but from the act of designing and testing hypotheses - a deeply cognitive process that cannot be offloaded. The implication is significant: for AI scientists to collaborate or share knowledge effectively, more sophisticated mechanisms of communication and contextual transfer must be built.
- FIRST PUBLISHED IN:
- Devdiscourse