Don’t Tell God What to Do
"God does not play dice," Albert Einstein famously bellowed at the burgeoning field of quantum mechanics at the historic 1927 Solvay Conference. It was a cry of principled resistance. Einstein was not merely being stubborn; he was defending a worldview where the universe was deterministic, knowable, and local. He spent the latter half of his life searching for "hidden variables" that would restore order to the chaos of probability.
Niels Bohr’s retort was less famous but perhaps more important: "Einstein, stop telling God what to do."
We, humans, have difficulty dealing with heuristical sciences. We like our world to follow models that we can map in our minds, where we can arrive at theories and equations using pen and paper, the way Newton did with gravity or Einstein with relativity. When a supposedly predictable or deterministic science betrays our faith and turns probabilistic—operating in dimensions our minds cannot intuit—the sharpest of our rational thinkers begin their rebellions. We do not expect determinism in many sciences, like in the patterns of a DNA or in the details of molecular structures, but a human-intuited logico-rational framework had worked for the study of motions for centuries, and our Einstein’s wanted that to be forever. In the 20th century, Bell’s theorem and Aspect’s experiments eventually proved Einstein wrong; nature is indeed "spooky."
Quietly but surely, we are witnessing a similar historical rhyme in “AI”. Computing was supposed to be deterministic. The code was supposed to be explicit. And then came the transformer, and with it a form of capability that emerges from statistical patterns in ways that no one can fully explain. Despite the futility of inexplicability, the rationalists march on to predict what could, or rather would not, be next. This has produced, over the past five years, an extraordinary parade of confident predictions about what AI fundamentally cannot do, many from the most distinguished minds in adjacent fields. What we have also witnessed, with remarkable consistency, is these predictions being falsified, often within months of their articulation.
As we enter 2026, with a new season of prediction upon us, it seems worth documenting what has actually happened. Not to mock anyone. The experts who made these predictions are brilliant people working from serious theoretical frameworks. Einstein was wrong about quantum mechanics, and he was still Einstein. But the pattern matters. When falsifiable claims keep falling, the response is often to retreat into unfalsifiable territory, using words like "understanding" or "AGI". This note is not about more claims, either falsifiable or not. We simply want to note that something important is being revealed about the nature of this technology, or what we at GenInnov have always deemed a mathematical revolution, and the limits of our ability to predict it (for those brave enough to wade through pretentiously philosophical verbiage, we proudly re-present our epistemic view of GenAI here and the Ontic view here).
Looking back, 2025 was a year that nobody predicted. It began with DeepSeek demonstrating that a Chinese lab could match frontier American capabilities at a fraction of the cost. It ended with AI systems achieving gold medal performance at the International Mathematical Olympiad and making genuine discoveries in mathematics. Very little of what happened in terms of capability, in terms of which players made the breakthroughs, in terms of the methods used, was foreseeable twelve months earlier. When one looks at details, all one sees is that whatever is expected, using our intuitive reason, let alone history, the "walls" imposed by our thoughts (and, in many cases, wishful thinking) refuse to hold firm against the onslaught of developments.
The Impossible Languages: Noam Chomsky and Statistical Learning
Noam Chomsky is among the most influential intellectuals of the twentieth century. His theory of Universal Grammar, the idea that humans possess an innate biological structure for language, has shaped linguistics for sixty years. In March 2023, Chomsky co-authored an essay in The New York Times dismissing large language models as "high-tech plagiarism." Effectively, hat human language acquisition relies on an innate, biological "Universal Grammar. Because LLMs learn from statistical patterns rather than innate structure, they would be "incapable of distinguishing the possible from the impossible" in human language.
Chomsky's framework made a testable prediction. Because we follow hierarchical syntactic rules that are biologically constrained, as per his theory, there must be. "Impossible languages," those violating these constraints (such as languages where grammar depends on counting word positions). If LLMs truly lacked innate linguistic structure, they should learn impossible languages just as easily as natural ones, unlike us humans.
In 2024, researchers directly tested this claim. The paper "Mission: Impossible Language Models" trained transformer models on synthetic impossible languages and compared their learning curves to natural language. The finding was striking: LLMs exhibited significantly higher difficulty and slower learning on impossible languages compared to natural ones. They showed inductive biases toward the hierarchical structures Chomsky claimed were exclusive to biological organisms. The "blank slate" critique appeared to be empirically incorrect; these models exhibit preferences for human-like linguistic patterns.
The debate continues, with some subsequent research finding more mixed results. But the absolute version of Chomsky's claim, that LLMs "by design" cannot distinguish possible from impossible, has been challenged by experimental evidence. The response from the nativist camp has been, predictably, to shift the goalposts from syntax to semantics, from statistical competence to "referential intent" and "genuine understanding." These may be important philosophical questions. But they are harder to falsify than the original linguistic prediction.
The Ladder of Causation: LLM’s Counterfactual Abilities
Judea Pearl, the Turing Award laureate who formalised the mathematics of cause and effect, offered perhaps the most rigorous theoretical critique of deep learning. His critique was mathematically elegant: deep learning, he argued, was stuck on the first rung of the "Ladder of Causation." It could see associations (smoke implies fire), but it could not do interventions (what happens if I ban cigarettes?) or counterfactuals (what would have happened if I hadn’t smoked?).
Pearl posited that you cannot derive cause and effect from observational data alone (a digression: see my personal reservations about Professor Pearl’s claims that causality detection has to need a biological brain in this review of his work from 2018). "Data are dumb," he famously noted. Without an explicit causal graph—a map of how the world works—a neural network is just fitting curves to history.
Then came GPT-4, and with it, a rare moment of intellectual concession. In late 2023, Professor Pearl admitted he had to "reconsider" his proof, but with the caveat that the error wasn't in the math. He hadn't accounted for the fact that human text is not merely observational; it is a repository of our causal models.
When a model reads the internet, it isn't just seeing that "smoke" sits next to "fire." It is reading millions of sentences explaining why they sit together. The machine didn't need to be hard-coded with the laws of causality; it simply read the manual. While Pearl still argues for neuro-symbolic approaches for the sake of scientific precision, the "impossibility" wall, that statistical systems are structurally incapable of causal reasoning, has been softened. The "curve" these systems fitted was the curve of human reason itself.
The Parrot That Played Othello
The most sticky metaphor of the last five years was undoubtedly the "Stochastic Parrot." Coined by Emily Bender and colleagues, it argued that Large Language Models were merely stitching together linguistic forms based on probability, with zero understanding of the meaning behind them. The term has remained popular in general media ever since.
The critique was that these models had no "world model"—no internal map of reality to ground their words. They were sophisticated mimics, playing a game of statistical solitaire.
This claim met its match in a 64-square board game. Researchers trained a small transformer model solely on the text transcripts of Othello games—sequences of moves like "E3, D3, C4." The model never saw a board. It was never told the rules. If the parrot hypothesis were true, it would simply memorize likely sequences of characters.
Instead, probes into the model's neural activations revealed something startling: it had spontaneously constructed an accurate, geometry-preserving representation of the Othello board inside its layers. To predict the next text token, the most efficient method was to actually "understand" the state of the game.
When researchers interfered with this internal model—flipping a "virtual" piece in the machine's mind—the machine’s output changed to reflect the new board state. This wasn't parroting; it was simulation. The "parrot" had built a world model simply by listening to enough descriptions of it. The stochastic parrot critique has since evolved. Bender, speaking at Harvey Mudd in late 2024, maintained that "when LLM output is correct, that is just by chance. You might as well be asking a Magic 8 ball." But, this is where the claims are taking the form of assertions one gets from the spiritual gurus rather than scientists.
The Fortress of Abstraction Where No Benchmarks Are Safe
For years, the ultimate test of "fluid intelligence" was the ARC-AGI benchmark, developed by François Chollet. Unlike the Bar Exam or the SATs, which could arguably be passed by memorization, ARC consisted of novel visual puzzles that required learning a new rule from just two or three examples. It was designed to be the "unbeaten" fortress. For five years, while LLMs conquered every other standardized test, they struggled to score above 50% on ARC, while humans easily scored 85%. This gap was cited repeatedly as proof that LLMs were merely "approximate retrievers," incapable of genuine on-the-fly abstraction. In late 2024, that wall fell. OpenAI’s o3 model achieved a reported 87.5% on the benchmark.
Crucially, the model did not require a new "neuro-symbolic" architecture or a biological brain. It utilized massive test-time compute to explore the search space of possible programs, effectively finding the abstraction through scale and search. Chollet, displaying the intellectual honesty that science demands, acknowledged the breakthrough.
The Mathematical Frontier: From "Brute Force" to Olympic Gold
If any domain seemed resistant to statistical pattern matching, it was mathematics. Mathematical proof requires rigorous logical reasoning, the ability to chain together long sequences of inferences where a single error invalidates everything. Critics argued that AI might memorise formulas but could never exhibit the creative leaps required for novel proofs. Yann LeCun, Meta's Chief AI Scientist and a Turing Award laureate, stated that LLMs "cannot reason" because they only predict the next token, which is insufficient for sustained logical tasks.
Almost every claim made in this regard, from the hypothesis formation to the inability to solve conceptual problems that have escaped the best of our minds, has been proven wrong. The list of mathematical achievements has reached a point, like the models’ capabilities in programming or image creation, that one does not even need to list examples to substantiate these conclusions. Perhaps most significantly, Terence Tao, arguably the world's leading mathematician, has shifted his stance. Initially cautious about AI's near-term utility for research mathematics, Tao has become increasingly impressed. He has used AI tools to find counterexamples to long-standing conjectures and now anticipates that AI will be "a trustworthy co-author in mathematical research." The "brute force" critique has given way to recognition that what humans call "intuition" may simply be high-dimensional pattern matching, exactly what these systems excel at.
The Broader Collapse: Context Windows, Data Walls, and Other Limits
Beyond individual critics, several consensus predictions about technical limitations have been falsified. Until 2023, it was widely accepted that attention mechanisms scaled quadratically, making long context windows computationally impossible. Research identified a "Lost in the Middle" phenomenon where models allegedly ignored information in the centre of long prompts. The prediction was that AI would remain fundamentally limited to short documents.
All models can now process millions of tokens. "Needle in a Haystack" benchmarks showed near-perfect retrieval of information regardless of position.
Similarly, prominent predictions warned that AI would "run out of data" by 2026, as high-quality public text became exhausted. This was framed as a resource constraint like Peak Oil. But the prediction failed to account for synthetic data generation and model self-play.
The collapse of these physical constraints was followed immediately by the breach of cognitive ones, specifically the "System 2" barrier. Critics like Yann LeCun had long argued that Large Language Models were inherently limited to "System 1" thinking—reactive, knee-jerk token prediction without the ability to plan or reason deliberatively. This "hard" boundary was shattered by "reasoning" models, which utilize inference-time compute to generate hidden chains of thought. By allowing the model to "think" and self-correct before responding, developers proved that reasoning capabilities could scale independently of model size, effectively grafting a deliberate, reflective brain onto a reactive engine.
Even the "Scaling Laws" themselves—the mathematical predictions of diminishing returns—were outmaneuvered by architectural shifts that critics failed to foresee. While skeptics pointed to asymptotic limits on dense models, the industry pivoted to "Mixture of Experts" (MoE) architectures, best exemplified by the efficiency of models like DeepSeek. By splitting the model into specialized sub-networks and only activating a fraction of parameters for any given token, engineers found a way to continue scaling intelligence without exploding compute costs, bypassing the "wall" that static historical charts said was inevitable.
This procession of falsification has created a recognizable sociological cycle. Critics confidently declare an asymptotic limit using terms like "data walls," "scaling plateaus," or "thermodynamic limits," often citing historical S-curves to prove the party is over. When a new architecture like MoE or a method like chain-of-thought obliterates that limit months later, these critics quietly withdraw, only to re-emerge a short time later with a slightly modified chart, claiming that now, finally, the true limit has been reached. It is a retreat played out in real-time, where the "laws" of physics and economics are invoked to explain why the machine cannot do what it has, in fact, just done.
Conclusion: The Vanity of Prediction
Nearly a century separates us from the Solvay debates, but the echo is deafening. When a science turns probabilistic—when it works in ways that cannot be fully traced by pen and paper—the fiercest critics are often the distinguished experts from the adjacent room.
Today’s most vocal skeptics are not Luddites. They are the brilliant pioneers of the previous regime: the masters of Recurrent Neural Networks, Symbolic AI, and classical linguistics. Like Einstein, they bring rigorous theoretical frameworks and genuine, principled concerns. But also like Einstein, their intuitions were trained in a world that no longer quite exists. They are trying to force a probabilistic revolution into a deterministic box, and the math is simply refusing to fit.
We see this in the relentless migration of the goalposts. When quantum mechanics defied the predictions of the EPR paradox, critics retreated to philosophical debates about the nature of "reality". Today, as AI systems bulldoze through barriers deemed impossible only months ago, the retreat is toward the definition of "understanding". This is not dishonest; these are profound questions. But we must note the pattern: as falsifiable claims fall, the argument shifts to territory that cannot be empirically adjudicated.
As we enter the new year, the "prediction industrial complex" will spin up its engines. The optimists will inundate us with 100-page slide decks, drawing smooth exponential lines from the steam engine to the singularity. The pessimists will counter by invoking "Laws" named after dead economists and quoting the most famous skeptics of history to warn us of the walls we are about to hit.
Both are exercises in vanity.
The record of the last three years is clear: falsifiable predictions are dying young. The velocity of disproof has accelerated to the point where a confident forecast is often just a future embarrassment waiting to happen. The mathematical revolution continues to surprise us, ignoring the "limits" we try to impose on it.
In this environment, the most intellectually honest position is not to plant a flag, but to open one's eyes with a simple admission that we simply cannot know where things are headed. And as long as there are no global rules to stop this unknowable march, we should stop trying to tell the math what it cannot do. It is far better to sit back, watch the dice roll, and learn.



