Evaluating Insight, Not Just Computation

2026-02-27T22:32:10+01:00

What I Actually Want to Measure

When I give an exam, my primary concern is not whether a student can carry out a long computation without mistakes. In machine learning, for example, nobody in industry will be asked to manually differentiate a loss function or expand a matrix derivation on a whiteboard, except perhaps during an interview. What matters, years later, is something more durable: whether they recognize when a problem is about optimization, when it reduces to linear algebra, when regularization is implicitly shaping the hypothesis space, or when a probabilistic interpretation clarifies what a model is actually assuming. The students need to see how optimization, regularization, and probabilistic modeling relate to one another.

In computer graphics, students must understand how linear transformations structure space. In hardware architecture, they must grasp how memory hierarchies and parallelism constraints shape performance. In each case, I am less interested in whether a student can reproduce a long derivation than in whether they understand the structural role a concept plays within a larger system.

In other words, I care less about procedural fluency and more about structural insight. The story would be different if these exams were given at an early bachelor level. But for late bachelor and master students, computations can always be reconstructed. Insight, however, is what allows someone to know which mathematics to reconstruct, and why. It is the difference between remembering formulas and understanding the landscape in which those formulas live.

The issue is not discipline-specific. It concerns the nature of understanding itself, and that has direct consequences for what we choose to test, how we test it, and what we consider evidence of mastery.

Climbing the Structure of Knowledge

There is a useful metaphor for the progression of knowledge in the so-called DIKW pyramid. At the bottom lies data; above it, information; above that, knowledge; and at the top, wisdom. Traditional evaluation often stabilizes somewhere between information and procedural knowledge: students demonstrate that they can manipulate symbols correctly and reproduce established arguments. That has value. But it does not necessarily move them upward toward structured understanding or sound judgment.

What I am trying to reach, by using insights-based learning (my own version of inquiry-based learning), corresponds to the upper layers of that pyramid. Moving from information to knowledge means not only applying a method, but understanding why it works, under which assumptions, and how it connects to other ideas. Moving toward wisdom, in an academic sense, means being able to reorganize knowledge, detect hidden structures, and evaluate when a framework applies, and when it does not.

This perspective aligns closely with what educational theory calls Higher-Order Thinking Skills (HOTS). HOTS go beyond recall and routine application. They involve analysis, synthesis, abstraction, transfer, and critical evaluation. Many educational reforms, such as inquiry-based science and reform mathematics, explicitly aim to cultivate these abilities, sometimes even reducing emphasis on rote procedures in favor of conceptual exploration. Assessments designed around HOTS typically rely on open-ended questions, justification, and structured reasoning rather than recognition-based formats.

In the context of the DIKW pyramid, HOTS can be seen as the cognitive operations that allow students to climb upward. They transform isolated information into organized knowledge and, occasionally, into principled judgment. My goal is not to eliminate computation, but to situate it within that ascent. Computation belongs to the structure, but it should not define the summit.

The difficulty, of course, is that insight is harder to measure than computation. A derivative is either correct or not. A matrix inversion either works or it fails. But how do we assess whether someone truly sees the connections between ideas?

What I am trying to evaluate lies closer to structural knowledge. It is the ability to organize techniques under unifying principles, to recognize when two results are instances of a broader theorem, or to detect when an assumption silently governs an entire chain of reasoning.

This applies whether we are discussing generalization in learning systems, homogeneous coordinates in graphics, or warp scheduling on GPUs. The domain changes; the epistemic structure does not.

Exams with Knowledge Graphs

This year, I am experimenting with a different approach. Instead of emphasizing derivations alone or broad concepts and terminology, I ask students to build a comprehensive mind map of the course before the exam and bring it with them.

For example, machine learning is not a linear subject; it is a graph. Loss functions connect to optimization theory. Optimization connects to convexity. Convexity connects to generalization guarantees. Generalization connects to bias–variance trade-offs. Capacity control connects to overfitting. Probabilistic modeling connects to uncertainty quantification. Each concept is a node, and the discipline emerges from the arrows between them.

The assignment is simple in appearance but demanding in practice. Students must identify the main concepts encountered during the semester and represent them as nodes. More importantly, they must draw arrows that encode precise relationships: general theoretical links (what is the link between a Gaussian process and kernel methods? between kernel methods and PCA?), implication, specialization, reformulation, causal influence, equivalence under assumptions, or methodological dependence.

It is at that stage that one can identify missing links, see hidden assumptions, or detect conceptual gaps. Knowledge becomes navigable rather than accumulative.

During the exam, instead of asking them to reproduce a lengthy proof, I may ask them to explain the meaning of a specific arrow (for example, why least squares does not work well for classification, what the link is between least squares and regression, or what the link is between regression and the sigmoid), justify why a connection should exist if it is missing, or argue why two nodes should not be directly connected without additional assumptions. In doing so, I am not grading memory; I am evaluating the structure of their internal graph.

This evaluation echoes my article on how to teach a graph of knowledge and my version of inquiry based learning.

A Built-In Revision Mechanism

I hope that this method also transforms how students study.

If a student postpones the construction of the mind map until the final days before the exam, they are forced to revisit the entire course in compressed form. They must traverse the conceptual terrain repeatedly, checking whether each connection makes sense. That phase is intellectually intense, but it forces them to extract structure from content.

If, instead, they build the map progressively during the year, every new chapter forces them to reintegrate prior knowledge. Each new concept triggers a systematic question: does this modify an existing node? Does it create a new link? Does it contradict a previous assumption? In effect, the method enforces continuous revision without explicitly scheduling it.

Either way, the graph grows denser, and insights are extracted.

Collective Understanding

There is also a collective dimension. While each student constructs their own map, they can discuss arrows in small groups. Debating whether early stopping is a form of regularization, or whether cross-entropy is simply maximum likelihood in disguise, forces them to articulate mechanisms rather than recite definitions.

The conversation shifts from “how do you solve this exercise?” to “what is the structural role of this concept in the theory?” That shift is precisely what I want. The goal is not isolated correctness, but shared structural understanding.

Building mind maps encourages students to work together, and many minds are often more powerful than an isolated one. I think it also helps refine concepts collectively, learning from one another how to extract meaning from equations.

Studying in the Age of Automation

We are entering a period in which increasingly autonomous systems can generate derivations (even if they are often incorrect), produce code, and replicate standard arguments. The technical shift toward highly automated systems is real, and trust in computational agents is expanding.

In such a context, the comparative advantage of human learners shifts. If procedural reproduction becomes automated, then structured thinking, the ability to connect, abstract, critique, and reorganize, becomes more central.

Across all the courses I teach, my objective is therefore consistent: I want students to begin forming the internal graph characteristic of expertise. If that graph exists, computations follow as structured consequences. If it does not, computations remain disconnected maneuvers.

What I Hope Remains

Ultimately, what I want students to leave with is not a collection of formulas, but a structured internal graph. Five years from now, they may not remember the exact algebra behind a support vector machine. But if they understand that it is fundamentally about margin maximization in a high-dimensional feature space, they can reconstruct the details when needed.

Insight functions as compressed knowledge: a small set of organizing ideas from which technical machinery can be regenerated.

Conclusion

I do not yet know whether this experiment will succeed. Evaluating insight is inherently more subtle than grading computations. But I am convinced that, in a world where calculation is cheap and automation is ubiquitous, education must aim higher in the pyramid.

If students leave not merely with formulas, but with a coherent and densely connected internal representation of the field, then the exam will have achieved its purpose. Computation will not disappear. It will simply be grounded in insight rather than memorization.

Additional notes:

Insight and Computation

When I say that computation is not my primary concern, I do not mean that it is unnecessary or that software can replace reasoning. What I mean is more precise: if the insight is present, the computation should follow naturally, even if a book is needed to recall the exact formalism.

If a student truly understands that a problem reduces to an optimization question, then gradients, constraints, or dual formulations are direct consequences. If they understand how a transformation acts on a vector space, then its matrix representation is no longer a memorized artifact but an inevitable encoding. If they understand how architectural constraints affect throughput, then performance formulas become explanatory rather than decorative.

Computation is not dispensable. It is derivative. It unfolds from structure.

Without insight, procedures remain isolated techniques. With insight, they become necessary expressions of a coherent framework.

Mind Maps and Their Intellectual Lineage

The idea of representing knowledge as a mind map is not new. Tony Buzan popularized mind maps decades ago in highly visual and colorful books that left a lasting impression on many readers. His central intuition was that understanding radiates outward from core ideas and branches associatively rather than unfolding as a rigid outline.

Years ago, I also read Alain Thiry and Fanny Demeulder, who developed structured study methodologies grounded in mind mapping principles. Their work formalizes the process: mind maps are not merely visual summaries but instruments for organizing cognition and reinforcing structural integration.

My approach builds on that tradition, but situates it within technical higher education.

References:

Anderson, L. W., & Krathwohl, D. R. (2001). A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives.
Brookhart, S. M. (2010). How to Assess Higher-Order Thinking Skills in Your Classroom.
Buzan, T. (1993). The Mind Map Book.
Thiry, A., Demeulder, F. (2012). “Ça y est, j’ai compris!”.
Ackoff, R. L. (1989). From Data to Wisdom. Journal of Applied Systems Analysis, 16, 3–9.
Zins, C. (2007). Conceptual Approaches for Defining Data, Information, and Knowledge. Journal of the American Society for Information Science and Technology, 58(4), 479–493.
Prince, M., & Felder, R. (2006). Inductive Teaching and Learning Methods: Definitions, Comparisons, and Research Bases. Journal of Engineering Education, 95(2), 123–138.

Interpolation, Extrapolation, and What Exams Really Measure

2026-02-07T17:30:00+01:00

This is Part 3 of a series on AI, education, and expertise. See also: Part 1: The Rise of Autonomous Systems and Part 2: Formal Verification and the Path to Machine Discovery

There’s a distinction I keep returning to when I think about what exams measure and what research demands. The simplest way I’ve found to express it is this: most exams reward interpolation. Research, in contrast, is extrapolation.

I realize this sounds like a neat formula, the kind that fits well in a tweet but collapses under scrutiny. And yet, the more I think about it, especially now, as large language models reshape how we write code, organize knowledge, and approach problems, the more this distinction feels very accurate.

The Interpolation-Extrapolation Distinction

Consider how exam preparation works. Students train on a finite set of exercises, examples, and problem types. The exam samples from roughly the same distribution. Questions may be disguised, combined, or slightly perturbed, but they remain, in a statistical sense, within the convex hull of what has already been seen. Success requires recognizing patterns, mapping them to known solutions, and executing reliably.

This is not inherently bad. Interpolation is a real skill. It reflects domain mastery, technical fluency, and the ability to apply known ideas consistently. These matter. But we should be honest about what exams measure, and what they do not.

Research asks a different question. Not “have you seen something like this before?” but “what happens outside what we currently understand?” The goal is not to recombine known patterns, but to push beyond them: to ask questions that were not part of the training set, to explore regimes where existing tools break down, and to build new concepts when old ones no longer suffice.

And yet, something important happens during exam preparation that I think we often overlook. When students work through exercises, they’re not just memorizing solutions, they’re building an internal structure, a way of seeing patterns, a repertoire of moves. This repertoire becomes the foundation for extrapolation. You cannot extrapolate from nothing. The ability to venture beyond known territory requires first having stable ground to stand on, a rich enough internal model that you can recognize when something is genuinely new versus merely unfamiliar.

In this sense, the interpolation skills developed through exams are not opposed to extrapolation, they’re prerequisite to it. The student who has deeply internalized enough examples, who has practiced recognizing structure across varied problems, is better equipped to notice when a problem falls outside the known space and requires something different. Mastery of interpolation, paradoxically, is what makes extrapolation possible.

Why Human Extrapolation Still Matters

In Part 1, I argued that human expertise is shifting upward: toward designing the harness, not just writing the code. In Part 2, I showed how formal verification systems like Lean provide constraints that might make extrapolation mechanically checkable. But there’s a deeper point I want to make explicit here: we cannot design good harnesses, or ask the right questions of formal systems, without becoming extrapolators ourselves.

This matters for a reason that’s easy to miss. When an LLM hallucinates, when it produces plausible but incorrect output, the failure often looks like success. The syntax is correct. The explanation sounds reasonable. The structure mimics what a real solution would look like. Detecting this kind of failure requires exactly the skill that exams, are meant to cultivate: the ability to recognize when something has stepped outside the valid space.

Even with formal verification, the problem doesn’t disappear, it transforms. Lean can tell you whether a proof is logically valid, but it cannot tell you whether you proved the right theorem. It cannot tell you whether your formalization captured the actual problem you cared about. That judgment requires deep understanding of the domain, the kind that comes from years of working through examples, building intuition, and learning to see when something is subtly wrong.

The Anthropic compiler experiment demonstrates this precisely. The success came not from better prompts but from carefully designed tests, reference oracles, and feedback loops that could detect when the system was fooling itself. Someone had to design those constraints. Someone had to know what correctness meant in that context. That someone needed to understand compilers well enough to recognize failure modes, edge cases, and the difference between “works on this input” and “works in general”.

This is why studying remains essential, not despite the rise of capable AI systems, but because of it. The art is shifting toward asking the right questions, yes, but formulating good questions requires knowledge of how things work. You cannot ask whether a compiler correctly handles type inference if you don’t understand what type inference is. You cannot verify a mathematical proof if you don’t know what the theorem means. The burden of verification intensifies, and verification demands expertise.

Skills such as: Algorithmic thinking, mathematical reasoning, abstraction, and the ability to hold multiple interacting constraints in mind are the things that need to be cultivated. We should guide students to reason without scaffolding, to operate when the path is not spelled out, and to notice when something is missing or inconsistent. These are the same skills required to guide, question, and correct LLMs rather than blindly trust them.

As this trend continues, the scarce skill won’t be typing programs but deciding what should be built, why it should work, and how to tell whether it actually does.

Are Exams Efficients ?

Exams are often criticized as artificial, stressful, or disconnected from real-world practice. Some of that criticism is justified. But exams do something I find extremely valuable: they build skills.

A well-designed exam doesn’t primarily test whether a student remembers a formula. It tests whether they can recognize structure, choose an approach, reason under constraints, and adapt known ideas to a new situation. In other words, it tests controlled extrapolation. Not full research-level discovery, but the ability to go beyond rote interpolation while remaining within a well-defined space.

This matters even more today than a decade ago, I think.

The interpolative skills they build, recognizing structure, reasoning under constraints, building mental repertoires, are precisely what enable extrapolation. You cannot venture beyond the map without first learning to read it. The students who develop deep interpolative mastery, who internalize enough examples to see patterns across problems, are the ones who will recognize when they’ve stepped outside known territory and need to think differently.

That recognition, that capacity to distinguish familiar from genuinely novel, is itself a form of extrapolation. And it begins with disciplined study, with well-designed exams, with the hard work of building internal structure.

That’s not a threat to education. It’s a reminder of what education has always been for.

Should We Still Teach Programming?

LLMs interpolators/extrapolators do not mean programming becomes irrelevant. Quite the opposite.

Programming remains a crucial epistemic tool. When studying a theoretical concept, the ability to simulate it, plot it, stress-test it, or explore edge cases computationally is invaluable. Writing small programs forces precision. It exposes hidden assumptions. It turns vague understanding into something concrete and falsifiable.

Even if future software production relies heavily on automated agents, programming will still play the same role that experiments play in physics or diagrams play in mathematics: a way to think, not just a way to deliver a product.

The skill shift isn’t from programming to theory, but from programming-as-output to programming-as-understanding. Exams, theory-heavy courses, and exploratory coding all point in the same direction: training minds that can reason, validate, and guide intelligent tools rather than compete with them.

If LLMs are becoming powerful extrapolators in constrained formal environments like Lean, then education must focus even more on something different but complementary: producing humans who can extrapolate in messy, undefined spaces where formal systems don’t yet exist.

That includes recognizing which problems need formalization, crafting the right constraints, and building the harnesses that make machine extrapolation possible. These are not mechanical tasks. They require judgment, taste, and deep theoretical understanding, qualities that cannot be outsourced to automation without first being supplied by humans.

Additional notes:

The Paradox of Learning to Code in an Agentic World

I want to address something I find genuinely difficult about the world we’re entering, something that affects students and newcomers more than experienced practitioners.

If you’re learning to program today, you face what looks like an exercise in futility. Every coding task you struggle with, an LLM can solve faster and often better. Why spend hours debugging when an agent can iterate through solutions in minutes? Why learn algorithms when the machine has seen millions of implementations? The psychological barrier is real: it’s hard to persist in learning a skill when you’re constantly reminded that a machine outperforms you.

And yet, this framing misses what programming actually does for you as a learner.

When you write code to explore an idea, to test whether an algorithm behaves as you expect, to visualize a dataset, to simulate a system, you’re not competing with agents. You’re building the mental structures that make it possible to interrogate them effectively. Programming forces precision in a way that reading or watching cannot. It exposes the gap between “I understand this in principle” and “I can make this work in practice.” It surfaces hidden assumptions, edge cases, and conceptual confusions that remain invisible otherwise.

This is exactly the shift I described earlier: from programming-as-output to programming-as-understanding. The goal is no more to produce production code. The goal is to build enough internal structure that you can recognize what questions to ask, what tests to run, what failures to look for. An LLM can generate code, but it cannot tell you which idea is worth exploring. It cannot recognize which result is surprising. It cannot know what you’re trying to understand.

In Part 1, I argued that human value is moving upward, toward theory, specification, and reasoning about correctness. But you cannot reason about correctness without having struggled with incorrectness. You cannot design good specifications without having built broken systems. The process of learning to code, especially when it’s hard, especially when you fail repeatedly, builds exactly the judgment needed to work effectively with autonomous systems.

This mirrors something I see in mathematics. Formal proof assistants like Lean can verify theorems, but mathematicians still work through proofs by hand, still struggle with examples, still build intuition through failure. Not because the manual work produces better proofs, but because the process of doing it builds the understanding needed to know what to prove and how to recognize when a result is meaningful.

So to newcomers facing this apparent paradox: yes, the machine codes better than you. That’s not the point. The point is that learning to code, especially learning through struggle and failure, builds the conceptual scaffolding that will let you see what the machine cannot: which problems matter, which solutions are elegant, which failures are informative, and which successes are accidental.

That understanding cannot be outsourced. It has to be built, one failed program at a time.

The Use of LLMs as Calculators

Every time we automate a skill, we don’t make that skill obsolete, we make the meta-skill of understanding and judging it more valuable. Calculators didn’t make arithmetic irrelevant; they made numerical reasoning more important. Compilers didn’t make understanding code irrelevant; they made understanding abstractions essential.

But there’s a crucial difference with LLMs that changes the nature of this pattern. Calculators and compilers are deterministic machines. We trust them precisely because they produce the same output every time, because their behavior is predictable and verifiable. LLMs, even those enhanced with formal systems, remain fundamentally stochastic at their core. The same prompt can yield different answers, some correct, some plausible, some subtly wrong. This introduces a qualitatively different kind of problem: the output can shift between runs, and there’s no guarantee of consistency even when formal verification catches logical errors.

So the pattern still holds, but the meta-skill has evolved. With deterministic tools, we learned to trust the tool and focus on formulating the right question. With stochastic tools, even powerful ones, we need something more demanding: the ability to formulate questions, the judgment to evaluate whether the answer is trustworthy, and the theoretical grounding to verify the result independently. The burden of verification doesn’t disappear, it intensifies. And this makes rigorous thinking, the kind that distinguishes plausible from correct, more critical than ever.

This concludes the series. Return to Part 1: The Rise of Autonomous Systems or Part 2: Formal Verification and the Path to Machine Discovery

Formal Verification and the Path to Machine Discovery

2026-02-07T17:15:00+01:00

This is Part 2 of a series on AI, education, and expertise. See also: Part 1: The Rise of Autonomous Systems

In the previous article, we saw how autonomous systems are decoupling correctness from code quality, producing functional but often “ugly” software through massive iteration. But agents generate new software by leveraging existing theory, examples, and structures rather than reasoning from first principles alone. This raises a natural question: are machines fundamentally limited to recombining existing patterns (interpolation), or can they genuinely discover new knowledge (extrapolation)?

Neural Networks as Interpolators

Neural networks, as commonly trained today, are extraordinarily good interpolators. Given enough data, they approximate complex functions with impressive accuracy within the region covered by that data. But they are notoriously brittle when asked to extrapolate.

I have seen this many times in my research. A simple illustration: the classic PenDigits dataset [1]. The task is handwritten digit recognition, but the test set comes from different writers than the training set. Humans barely notice the shift, we effortlessly generalize across writing styles, strokes, and personal quirks. Neural networks often suffer a significant performance drop. The distribution shift is small in human terms (extrapolation task) but large in statistical ones (interpolation task).

The network has learned what digits look like in the training distribution, not what a digit is.

This gap between interpolation and extrapolation appears everywhere: reinforcement learning, control, language, reasoning. Models excel when test data is “more of the same” and struggle when structure changes meaningfully. Humans extrapolate constantly. We reason with sparse data, build mental models, apply them in novel situations. When faced with a new problem, we don’t merely search for a nearby example, we ask what should happen, based on principles, abstractions, and causal understanding.

Seen through this lens, many current debates become clearer. Existing benchmarks evaluate interpolation, which is extremely useful on its own, they test generalization within a dataset family. But for an LLM to be useful for research, we requires also extrapolation.

None of this diminishes the value of neural networks or large language models (LLMs). Interpolation at scale is powerful. It changes how we access information, explore ideas, and prototype solutions. But extrapolation, the ability to genuinely extend knowledge, remains a different problem.

Why LLMs Remain Interpolators

LLMs are remarkable interpolators over an enormous and diverse corpus of human text. The interpolation space is vast, which makes their outputs appear surprisingly general. But vast interpolation is still interpolation. The fact that the space is high-dimensional and richly structured doesn’t automatically grant extrapolative ability in the sense research demands.

The agent-based coding systems from Part 1, despite their impressive capabilities, remain fundamentally constrained to the realm of their training distribution.

Of course, for some problems, even if humans don’t have the answer yet, having a big corpus of data allows LLMs to get pretty far in what could be considered extrapolation. But in practice, from the model’s perspective, it’s still interpolation.

This explains why LLMs can feel simultaneously impressive and fragile. They answer questions, generate code, and explain concepts fluently, until the problem subtly steps outside the patterns they’ve internalized. Then the cracks show: confident but incorrect answers, brittle reasoning chains, shallow analogies that collapse under scrutiny.

I’ve seen this firsthand when I code. When I ask an LLM for generic code, it produces several versions with different strategies, in many languages. When I ask for scientific code, it’s usually full of mistakes, offers only one approach, and the number of languages is restricted. This suggests to me they can’t really extrapolate, at least not yet.

If we want machines to participate in research rather than merely assist with it, we need to understand this gap clearly, not blur it with impressive demos. And if we want to educate students for research, we should be honest about what our exams actually measure. Because passing an exam means you can interpolate. Doing research means you can go where the map ends.

Extrapolative Neural Networks

Recent events complicate that story, and I find them fascinating.

Earlier this month (02.2026), the startup Axiom published four original mathematical papers [2-6], each containing a complete and mechanically verified proof of a previously unsolved or partially solved problem. In one striking case, a problem in algebraic geometry that had resisted human effort for five years was solved overnight after being presented to Axiom’s system. The key step wasn’t brute-force computation but a reformulation of the problem, a change of perspective that reduced it to a known identity no one had noticed was relevant.

This is precisely the kind of move we associate with extrapolation.

What makes this particularly interesting to me is not just that the proofs exist, but how they were produced. The system doesn’t merely generate informal arguments, it translates the entire reasoning process into Lean, a formal proof language in which every logical step is mechanically checked. The result is not a persuasive explanation but an object that can be verified, rerun, and audited by anyone.

This matters because it sharply constrains the space in which the system operates.

A large language model trained purely on natural language is free to generate fluent text, but it’s also free to hallucinate, gloss over gaps, and smuggle in unjustified steps. Lean removes that freedom.

Lean [7,8] is a proof assistant, software that requires every step of a mathematical proof to be formally verified by a computer. Think of it as a compiler for mathematical reasoning: either your proof type-checks, or it doesn’t. There’s no such thing as “almost correct”. The dependency either exists or it fails.

This is why, in my view, training models with formal systems like Lean is the right direction of progress beyond purely interpolative networks. Lean acts as a forcing function toward extrapolation.

What makes Axiom interesting as a case study isn’t that we understand its training methodology, we don’t, and the papers are too recent for proper scrutiny. Rather, it’s that Lean provides something the C compiler example lacked: a mechanistic test for extrapolation.

The C compiler could be evaluated on correctness (does it compile Linux?) but not on novelty (are the solutions new?). Axiom’s proofs can be evaluated on both: Lean verifies correctness automatically, and the mathematical community can verify novelty by checking whether the theorems were previously proven.

This is why formal systems like Lean matter for understanding the interpolation/extrapolation boundary: they don’t just constrain hallucination, they make the distinction between recombination and discovery mechanically checkable.

To succeed, the model must construct chains of reasoning that survive outside the statistical comfort zone of plausible text. It must discover structures that actually hold, not just ones that sound right. Reformulation becomes a necessity, not a stylistic flourish. The system is pushed away from surface-level interpolation toward something closer to genuine conceptual navigation.

I find myself wondering: if we stretch this idea further, by retraining with novel discoveries, and if continual learning [9] one day succeeds, we could directly incorporate theorems that were previously unknown into the training set. This wouldn’t be cheating with train/validation splits because they can be formally verified. Incorporating formal verification as part of the loss function means we can generate many new verifiable examples automatically. So, step by step, training by training, we might enable genuine extrapolation.

In mathematics, Lean doesn’t discover proofs in a vacuum. It provides a formal environment where correctness is non-negotiable, where every step must type-check, where ambiguity is eliminated. When LLMs are trained and evaluated inside such environments, something important happens: they stop optimizing for plausibility and start optimizing for truth.

For programming, I think we’re converging toward the same idea. Strong test suites, formal specifications, verified compilers, property-based testing, model checking, these aren’t just “engineering best practices” anymore. They’re becoming the interface between human intent and machine interpolation. A “Lean for programming”, or rather, a family of formal, executable specifications, may be the most realistic path forward for reliable autonomous software development.

The Structure of Extrapolation

This changes how we think about machine learning. Perhaps the issue isn’t that machines can’t extrapolate, but that extrapolation requires a space where correctness is rigid, feedback is immediate, and structure can’t be faked. Mathematics, when formalized, provides exactly that environment.

Interestingly, this mirrors how humans learn to extrapolate in mathematics. We don’t acquire that skill by reading polished proofs alone, but by struggling with definitions, failing attempts, reformulating problems, and checking every step until something finally clicks. Lean externalizes this discipline. It turns extrapolation into a navigable space rather than an act of intuition alone.

From this perspective, systems like Axiom show us something important. What they demonstrate is that genuine discovery becomes accessible to machines when the task is embedded in a formal structure that enforces meaning.

Large language models trained only on text remain extraordinary interpolators. Large language models trained to reason inside formal systems may become something else entirely.

Whether that path leads to AGI is an open question. But if it does, I don’t think it will come from ever-larger benchmarks that reward familiarity. It will come from environments where the map is incomplete, the rules are strict, and the only way forward is to genuinely discover new structure.

That, after all, is what research has always been.

Conclusion

The thread running through this discussion is simple: what a system can do depends critically on the environment it is placed in. Interpolation thrives in open-ended spaces where plausibility is enough. Extrapolation requires something stricter, a setting where correctness is rigid, feedback is immediate, and shortcuts are impossible.

Large language models demonstrate how powerful interpolation can be at scale. That power is real and valuable, but mistaking it for discovery creates confusion in research, education, and public debate. Benchmarks overwhelmingly measure competence within a known distribution. Research, by contrast, begins when that distribution no longer applies.

Formal systems like Lean change this dynamic. They collapse the distinction between “sounds right” and “is right”. When machines are forced to operate inside such systems, extrapolation becomes a navigable process rather than an act of stylistic fluency.

This has implications beyond AI. If we care about extrapolation, in students or machines, we must build environments that demand it. Mathematics, formal verification, and executable specifications show one viable path: restrict freedom in the right way, and genuine structure emerges, for students, it is still to be determined what is the right path.

Additional notes:

Verification: The Case of Axiom

The recent breakthroughs from Axiom are promising, but we must remain cautious. While the proofs themselves are mechanically verified by Lean, eliminating the standard concern of LLM hallucination, the “how” remains a black box. These are incredibly recent papers, and we do not yet fully understand the implications of the training methodology used to produce them. We don’t know if this approach scales, if it relies on subtle data contamination, or if it represents a sustainable path toward general reasoning.

Until these results are fully integrated into the broader mathematical canon and the underlying training paradigms are transparently stress-tested by the research community, they remain “promising artifacts” rather than settled law. But they prove one thing: when we give an interpolator a formal cage to play in, it can occasionally find the key to the door leading outside.

Ultimately, these notes reinforce the same conclusion: whether we are navigating an alien codebase or auditing a machine-generated proof, the burden of final judgment remains stubbornly human.

These observations about formal systems, machine capabilities, and the importance of theoretical skills raise a deeper question: what fundamental distinction separates pattern recognition from genuine discovery? In the final part of this series, we’ll explore this conceptual framework and what it means for how we think about learning, research, and the future of human expertise.

Continue reading: Part 3: Interpolation, Extrapolation, and What Exams Really Measure - the core conceptual framework underlying this series

References:

Alpaydin, E. & Alimoglu, F. (1996). Pen-Based Recognition of Handwritten Digits [Dataset]. UCI Machine Learning Repository. 10.24432.
Axiom theorem proving (french), Wayback
Parity of k-differentials in genus zero and one, 2602.03722
Fel’s conjecture on syzygies of numerical semigroups, 2602.03716
Dead ends in square-free digit walks ,2602.05095
Almost all primes are partially regular ,2602.05090
Moura, L.d., Ullrich, S. (2021). The Lean 4 Theorem Prover and Programming Language. In: Platzer, A., Sutcliffe, G. (eds) Automated Deduction – CADE 28. CADE 2021. Lecture Notes in Computer Science, vol 12699. Springer, Cham. 10.1007/978-3-030-79876-5_37
The Lean Theorem Prover
Continual Learning with RL, Wayback

The Rise of Autonomous Systems - When Machines Write Their Own Code

2026-02-07T16:58:10+01:00

This is Part 1 of a series on AI, education, and expertise.

Something has changed in how companies approach software development [1,2]. The phrase “the art of programming” already hints at something important: programming has an artistic component. There’s care in making code beautiful, readable, extensible, maintainable. That matters, especially to programmers themselves.

But historically, programming was never an end in itself. It’s a means to an end. Companies build products for users, not for the intrinsic beauty of the codebase. If we view programming primarily as a tool, then it’s natural to want something better at achieving the underlying goal.

If the resulting code is ugly underneath but produces correct answers, the goal is achieved; if a client is unhappy and we can ask a system to refine the pipeline without breaking it; if this refinement happens in hours without supervision, this is the grail many companies, and many researchers, have chased for decades.

For a long time, large language models simply weren’t reliable enough to write or maintain code at scale. By late 2025, something shifted subtly. With the previous introduction of agents, skills, and now better performing models with longer horizon contextes and thinking capacities, notably Claude Code, coding capabilities changed qualitatively. Instead of producing code that degraded as complexity increased, agents began writing systems that could locally repair themselves, improve iteratively, and (hopefully) remain stable over time.

Much of this improvement addresses what I think of as the context problem. When a single model tries to reason about everything at once, its context becomes polluted and output quality drops sharply. Agent-based systems [3] mitigate this by decomposing work into smaller subtasks. A child agent focuses narrowly on a specific problem, then sends a concise summary back to its parent. This keeps the parent’s context clean while preserving essential information for coordination.

Another key idea is what we call skills [4], particularly in the Model Context Protocol (MCP) [5] context. The goal: give language models access to the external world through tools. Naively, this would require exposing large APIs directly in the model’s context. Imagine giving an LLM access to all of Chrome’s functionality, the API alone would overwhelm the context window immediately. Skills solve this by acting as minimal specifications of desired interactions. A small team of agents builds the required tool on the fly, exposing only a narrow interface. The orchestrating agent operates with a minimal API in context, drastically improving reliability.

These architectural improvements enabled a remarkable demonstration, Anthropic recently built a complete C compiler using agents, a budget around $20,000, and only the C language specification [6]. This compiler isn’t yet on par with GCC, it’s slower and less optimized, but it achieved something I find remarkable: it successfully compiled the full Linux kernel.

That’s not a toy benchmark. It’s a strong signal we’re approaching a world where fully automatic systems are not just plausible but operational.

When Correctness Decouples from Quality

This experiment makes one thing clear to me: goal-directed correctness is now decoupling from code quality.

Compiling the Linux kernel is one of the most demanding integration tests in systems programming: enormous codebase, decades of accumulated assumptions, undefined behaviors, edge cases, architecture-specific quirks, implicit contracts. That a compiler written autonomously by LLM agents can get that far already feels extraordinary to me.

Even with all the caveats, inefficient generated code, reliance on GCC for some phases, lack of elegance, partial shortcuts, the result surprised me. A 100,000-line clean-room compiler, capable of building the Linux Kernel, would have been considered science fiction not long ago. The fact that it’s “ugly” by expert standards doesn’t really matter at this stage. What matters is that the system can navigate toward a precise goal, detect failures, adapt its behavior, and eventually satisfy a brutally strict external constraint.

The agents aren’t “inventing” compiler theory. They’re not discovering new abstractions in the sense a human researcher might. Instead, they’re performing massive, guided search in the space of known ideas, implementations, and failure modes, but at a scale and persistence no human could sustain. Thousands of iterations, relentless testing, endless retries, zero fatigue. The result is not a beautiful solution but a working one.

And I think that’s the key shift.

For many practical goals, ugly but correct is already sufficient. The traditional human advantage, writing clean, elegant, well-factored code, is no longer the decisive bottleneck for many tasks. What becomes decisive is the ability to define the right goal, design the right tests, and recognize when the system is fooling itself.

Notice how much success in the Anthropic experiment comes not from “better prompts” but from better harnesses: carefully designed tests, oracles (GCC as a reference), feedback loops, constraints that make it possible for agents to orient themselves. The intelligence is not just in the model; it’s in the structure surrounding it.

What This Means for Human Expertise

This reframes human contribution. If LLM agents can already produce large, functional systems through persistent iteration and testing, human expertise shifts upward: toward theory, toward specification, toward reasoning about correctness, complexity, invariants, failure modes.

This is exactly why I believe theoretical knowledge, algorithmic thinking, and mathematical maturity matter more, not less, in the age of LLMs. These are the tools needed to design the harness, not just the code (more on this in Part 3).

The world doesn’t need everyone to be a human GCC. But it desperately needs people who can tell whether a system is correct, understand why it works, see where it might break, and design constraints that prevent silent failure. That kind of judgment can’t be outsourced to automation. It’s built through deep understanding, abstraction, and disciplined reasoning, precisely the skills developed through theory-heavy courses, exploratory problem solving.

The Anthropic compiler is impressive to me not because it replaces human programmers outright, but because it exposes where human value now lies. We’re moving from writing code to shaping computation. From producing artifacts to defining spaces in which artifacts can be safely generated.

In that world, learning to think formally, whether through mathematics, algorithms, or programming-as-experimentation, isn’t an anachronism. It’s preparation.

Additional notes:

The Debugging Nightmare and the “Complexity Wall”

There is a profound difference between a codebase that is “ugly” because of human laziness and one that is “ugly” because it was synthesized through millions of stochastic iterations. Human-written mess usually follows some form of idiosyncratic logic—there is a “ghost in the machine” you can eventually reason with. Machine-generated mess is often structurally alien.

As we move toward a world of “Post-Elegance Engineering,” we face a terrifying debugging nightmare: the Complexity Wall. If a team of agents builds a 100,000-line system that functions today, what happens when it fails tomorrow in a way the agents cannot self-repair? We risk creating “digital black boxes”, systems that are functionally correct but cognitively impenetrable. Humans generally dislike working in environments where they lack “conceptual ownership”. If we cannot navigate the code, we cannot truly trust it. The psychological toll of maintaining a system one does not understand is a variable we haven’t yet factored into the future of work. We don’t yet know if the efficiency of automated generation will be eventually canceled out by the sheer cognitive load of human oversight.

The Economics of the Agentic Employee

To put the Anthropic experiment in a soustenability perspective, consider the cost [2]. Building a functional C compiler for $20,000 in two weeks is an incredible feat of efficiency when compared to traditional labor.

In 2026, a mid-level software engineer carries a total compensation package of roughly $150,000 to $400,000 when factoring in benefits and corporate overhead. A junior engineer might cost $90,000. If an AI agent burns $1,000 a day in API credits, it is roughly equivalent to the gross salary of a single full-time employee, but with zero “hidden” costs like healthcare, office space, or management latency. Crucially, the agent doesn’t sleep; it provides the output of an entire 16-person “team” for the price of one human specialist. We are no longer just buying a tool; we are leasing a workforce. Continue reading:

Part 2: Formal Verification and the Path to Machine Discovery - examining mathematical proof systems and machine extrapolation
Part 3: Interpolation, Extrapolation, and What Exams Really Measure - the fundamental distinction and what it means for education

References:

Teaching a Graph of Knowledge as a Story

2025-12-28T10:49:35+01:00

I have the chance to teach several courses: computer graphics, CUDA programming, and machine learning. One of the main difficulties I constantly run into is that the content of these courses does not form a linear sequence. Instead, it forms a graph of knowledge.

This is especially true for machine learning. The field evolves quickly, new ideas appear constantly, and many concepts depend on each other in non-trivial ways. There is no obvious “Chapter 1 → Chapter 2 → Chapter 3” path that feels natural or complete. CUDA programming is no different: it assumes parallel thinking long before students have fully developed it. Computer graphics is no better, as it requires understanding the full pipeline that produces an image on a computer before knowing how and why each part works.

And yet, teaching happens in time. Slides advance one after the other, and lectures unfold sequentially. Teaching is linear by necessity. This tension, between a graph of concepts and a linear presentation, is at the core of how I think about teaching. The question, then, is not whether we impose a linear order, but how.

Modern technical education often assumes that knowledge can be cleanly reduced to outlines, tables of contents, and dependency trees. If the structure is correct, the thinking will follow, or so the assumption goes. This idea is not new. It can be traced back at least to Petrus Ramus, who believed that all knowledge should be reduced to orderly diagrams and linear schematics. His influence is still visible today: chapters, subsections, bullet points, and curricula that promise clarity through structure alone.

But there has always been another tradition. Ramon Llull, and others like him, understood that humans do not learn primarily by traversing diagrams. They learn through stories, images, and meaning, even when the underlying structure is complex.

This way of thinking also connects naturally to the idea of threshold concepts [1][2] in education: concepts that, once understood, fundamentally transform how a student sees a subject. I think the insights I try to create are a sort of threshold concepts. Threshold concepts are often troublesome, irreversible, and integrative; they reorganize the mental landscape rather than adding another node to it. The advantage of a threshold concept is that, when you finally grasp it, you cannot ever forget or unsee it. In my experience, many of the hardest moments for students occur precisely when they are approaching such thresholds. A strictly structural or chapter-driven approach tends to hide these moments, while a narrative, insight-driven approach can bring them to the foreground.

I sometimes refer to this approach as insight-based teaching. It is not opposed to structure, but it does not start from it. It is influenced by my view on inquiry-based learning, in the sense that understanding is built through exploration, questioning, and reframing, rather than through passive traversal of a predefined outline. The structure emerges after the insight, not before.

Why I Don’t Teach Strictly by Chapters

I do have chapters. I do have sections. But they are not the backbone of my courses. Instead, the core structure follows something closer to a story. Ideas are introduced when they become meaningful, not necessarily when they would be considered “correct” according to a strict table of contents.

In my slides, I focus on maintaining a clear linear path of understanding, without relying on future knowledge to make sense of the present. This warps the underlying graph in unexpected ways and can sometimes look like unstructured content. But make no mistake: this does not mean the material is unstructured.

On the contrary, I put a great deal of effort into linearizing the concepts, arranging them in an order that makes the story feel coherent, causal, and logical, even if that order does not match the true structure of the conceptual graph. I prioritize insights over formal structure. The goal is that, at any given moment, students feel that what comes next makes sense given what they already know, and that at each step, they should understand why the next idea appears.

This is where I consciously diverge from the Ramist impulse to reduce everything to schematic order. Structure matters, but it should serve understanding, not replace it.

A Parallel with How Mathematics Is Often Taught

This way of teaching is not as unusual as it may seem. In many universities, pure mathematics has long been taught in a similar spirit. A lecturer may enter the room without slides, start with a question or a mathematical object, and then proceed by exploring its properties, constraints, and consequences. The lecture unfolds as an investigation rather than as the execution of a prewritten script.

In such settings, the direction of the lecture often depends on what students ask, what confuses them, and which paths seem worth following. The structure is not fully specified in advance. The implicit chapter might be “we should understand this object”, but the exact route, and sometimes even the intermediate results, are not always known beforehand. What matters is that the object is explored thoroughly and meaningfully by the end.

This style of teaching is inherently unstructured in appearance, yet deeply structured in intent. Its coherence does not come from a predefined list of topics, but from the internal logic of the mathematical object itself. The exploration continues until the object has been seen from enough angles to become stable in the students’ minds.

Once this global map has been built, later and more advanced lectures can afford to focus on specific parts of it. At that point, the teaching can become more technical, more specialized, and sometimes even drier. But this dryness is no longer a problem: the underlying structure is already understood, the key insights are in place, and new details have somewhere to attach. Precision can replace exploration precisely because exploration has already done its job.

Another characteristic of this exploratory style is how prerequisites are handled. While investigating an object, it may become clear that students are missing some necessary background. Rather than treating this as a failure of preparation, the lecture naturally opens a parenthesis: the missing concept is introduced on the spot, just deeply enough to allow the exploration to continue. This detour is itself exploratory, motivated by an immediate need rather than by an abstract curriculum requirement.

A good example of this approach can be seen in Po-Shen Loh’s lectures (American IMO coach) on discrete mathematics, such as his CMU course, Loh, an American IMO coach, regularly pauses the main line of exploration to develop just enough background for the argument to move forward. These moments are not digressions; they are integral to the process. The prerequisite is learned because it is needed, and its purpose becomes immediately clear. This style is one of the reasons his teaching, both in lectures and through platforms like Expii, is so widely appreciated by students.

My own approach is strongly influenced by this tradition, even if the medium is different. I prefer using slides not because I want tighter control over content, but because I do not want students to spend their time transcribing what I say. Slides allow them to focus on the ideas as they unfold, while providing carefully designed figures and notation that would be cumbersome to recreate by hand.

Despite the presence of slides, I try to teach as if the lecture were an exploration. The sequence is guided, but not rigid. The goal is not to cover a checklist of results, but to investigate an idea until it becomes clear. In that sense, the slides act less as a script and more as a shared workspace: a place where questions, visualizations, and partial insights can accumulate and connect.

This is also why the content of a “chapter” is not always fully determined in advance. From time to time, I add verbal or blackboard explorations in response to students’ questions. What must be explored is known; how that exploration unfolds depends on the students, the questions that arise, and the conceptual obstacles encountered along the way. As long as the object has been genuinely explored, the chapter has done its job.

The Necessary Strain of Learning

This approach inevitably creates some strain for students. They are not used to material that is not presented strictly as a table of contents. When they study, they need to revisit the material and form their own structure, one that, at that point, begins to resemble the true underlying graph. This is demanding work.

But even if I followed a strict table-of-contents path, that strain would exist anyway. If I strictly followed chapters and subsections, the same problems would appear: chapters would constantly refer to material further ahead in the course, or assume background knowledge students do not yet fully have. The strain is unavoidable, the question is whether it is meaningful.

I believe it is. Learning is not passive consumption, it is active reconstruction. One of the most important parts of learning is the student’s own work of restructuring knowledge into something that makes sense to them. That effort, reorganizing, connecting, revisiting, is where real understanding forms.

Once concepts have been understood through a linear narrative, it becomes much easier to mentally reconstruct the original graph: to move freely between ideas, to apply them out of order, and to form higher-level insights. But that reconstruction must happen internally. No outline can do it for them.

Why Books Often Feel Hard to Follow

If students want pure reference material, they can easily pick a book; many excellent ones are freely available. Books often present knowledge closer to its graph structure, supported by a table of contents. In machine learning, I often see three recurring patterns, usually mixed together.

The first is the grand overview: a long introductory chapter that covers the entire field in a hand-wavy way, followed by chapters that zoom into details while silently assuming the introduction is now fully internalized. The reader is expected to rework those initial concepts in the context of later chapters, a task that is difficult and often one students are reluctant to undertake.

The second is the assumed background. Books freely use tools like convex optimization, probability theory, or linear algebra, assuming readers have seen them before, without making those assumptions explicit. While it is understandable that authors cannot explain everything for every audience, it is unlikely that the author’s background and the student’s background align closely enough for this to work seamlessly.

The third is the forward-reference loop. Chapters repeatedly refer to concepts “explained later,” forcing the reader to constantly jump ahead or trust that things will eventually make sense. This requires continuous mental bookmarking and frequent back-and-forth through the material.

None of these approaches are wrong. But they require significant effort from the reader, and they are often difficult precisely because they expose the graph of knowledge directly. The strain on the student would be there anyway, just in a different form.

Reconstructing a Graph, Not the Graph

Over time, I have seen several students independently adopt tools like Obsidian, often using its knowledge graph view to organize their notes. What matters to me is not the tool itself, but what it reveals about how understanding evolves.

Figure 1. Example of obsidian graph of 5900 curated notes. Credits reddit.

These graphs do not mirror the underlying structure of the material. They are the student’s own graph: partial, uneven, sometimes messy, and constantly evolving. And that is precisely the point.

A beginner typically experiences a subject linearly. Concepts arrive one after another, and understanding is local: “this follows that”. As learning progresses, that linear view slowly breaks down. Ideas begin to connect across lectures, across chapters, and sometimes across courses. The material stops feeling like a path and starts feeling like a space.

I believe that reaching this stage is a strong marker of real understanding. A true expert does not hold a subject as a sequence, but as a richly interconnected graph of ideas. From that graph, one can see not only what is known, but also what is missing: gaps in understanding, weak connections, and unexplored paths. Questions arise naturally from the structure itself.

Graph-based note-taking tools make this transition visible. They externalize a process that normally happens only in the mind: the gradual construction of a personal knowledge graph. The goal is not to match some canonical structure, but to build a coherent internal model that can be navigated flexibly.

In that sense, the linear narrative of a course is a temporary scaffold. Mastery begins when students no longer rely on that scaffold, and can move freely within their own graph; seeing connections, noticing absences, and extending the structure as new ideas are encountered.

My Teaching Goal

I know that reference material exists, and I want students to be able to use it effectively. My focus, however, is different. I aim to provide core insights: the main ideas that make everything else click once students encounter it again in a book or a paper. I want them to build mental images, intuitions, and internal structure.

That is why I teach more like a story than like a table of contents. If students walk away with the right insights, then when they later read material that is denser, drier, and more complete (as reference material often must be), it suddenly becomes clear. The formulas stop being symbols on a page and start representing something meaningful. Those well-interconnected core insights form the beginning of their internal graph, which they can expand with other lectures and references.

This does ask something of the students: they need to reorganize the material for themselves, even if I try to make that as easy as possible. But that effort is not a drawback, it is the point.

My purpose as a teacher is to teach, not to throw a book at students and ask them to recite parts of it. Teaching, to me, is not about transmitting a perfectly ordered diagram of knowledge. It is about guiding students toward understanding. Otherwise, what would be the point of universities, if one could simply read enough books?

I find my time, and theirs, best spent when I give core insights that come from experience and long engagement with the material, insights that are not readily accessible in the books. Not because they are absent from books, but because they are often buried beneath layers of formalism.

One important difference today, compared to even a few years ago, is that students are no longer alone when trying to reconstruct structure. In the age of large language models (LLMs), students can actively interrogate the material: asking questions about slides, requesting alternative explanations, generating plots to explore how equations behave, or testing “what happens if” variations that would have been tedious to do by hand. Used properly, these tools can help compensate for the lack of an explicit table of contents by supporting exploration and sense-making.

But this only works if students practice the right skills. Asking good questions, forming hypotheses, building and refining mental images, and checking intuition against computation are themselves part of learning. Insight does not come from the tool, it comes from the process. LLMs can accelerate that process, but they cannot replace the internal work of forming understanding.

The Role of Visual Thinking

Finally, I put a lot of effort into figures. Many of the subjects I teach rely on abstract mathematics and symbolic reasoning, and visual representations help anchor these abstractions. They connect formulas to geometry, computation to space, and algorithms to intuition.

This is not aesthetic embellishment. It is a deliberate pedagogical choice, closer to Llull’s illustrated reasoning than to Ramus’s bare schematics. Equations are essential, but without mental images they remain fragile. Visuals help students truly own the concepts, not just manipulate them.

This emphasis on visual and conceptual thinking is not new. It echoes ideas found in works such as Engel’s book on programming mathematics, recently being updated to python, where programming is presented not merely as implementation, but as a way of thinking mathematically, externalizing structure, testing intuition, and refining insight through concrete experimentation. Writing code, drawing figures, and manipulating equations are all ways of thinking, not just producing results.

In that sense, building mental images and insights is a craft. It can be practiced, refined, and taught, even if it cannot be fully reduced to a checklist or syllabus.

In the End

Teaching a graph of knowledge in linear time is inherently difficult. There is no perfect ordering, no universal outline. But I believe that a carefully constructed narrative and embracing intuition, one that respects structure without being dominated by it, helps students do more than follow along. It helps them think.

And once they can think with the material, they can navigate the graph on their own.

References:

Meyer, J. H. F., & Land, R. (2005). Threshold Concepts and Troublesome Knowledge (2): Epistemological Considerations and a Conceptual Framework for Teaching and Learning. Higher Education, 49(3), 373–388. 25068074 (pp. 412-424). Edinburgh: University of Edinburgh.
Breen, S., & O’Shea, A. (2016). Threshold Concepts and Undergraduate Mathematics Teaching. PRIMUS, 26(9), 837–847. 10.1080/10511970.2016.1191573

NHST vs. Bayesian Data Analysis, and what about machine learning?

2025-10-01T02:30:42+02:00

Why I Prefer Bayesian Data Analysis Over NHST

I’m by no means a statistician, and what follows reflects only my view. Still, I’ve spent some time wrestling with statistical tools in research to have formed some opinions on why Null Hypothesis Significance Testing (NHST) often feels unsatisfying, and why I find Bayesian data analysis a better fit for how we actually think about uncertainty.

Most of us were first introduced to NHST in our early stats classes. It goes something like this:

Assume a null hypothesis.
Compute a test statistic from your data.
Get a p-value, the probability of observing something as extreme as your result under the null.
If the p-value is below a magic threshold (often 0.05), you “reject” the null.

It sounds neat and tidy, but the interpretation is tricky. A p-value is not the probability that your hypothesis is true. It is not the probability of being wrong. It’s only the probability of seeing your data (or something more extreme) if the null were correct - a subtlety that often leads to confusion.

The Infinite Space of Statistical Tests

Choosing the “right” test is harder than it looks. Imagine the design space:

Sample size $S$
Parametric vs. non-parametric assumptions $PNP$
Distribution families and their parameters $F_p$
Single or multiple hypotheses $H$
Dependence structures $D$
And countless other details $O$

This space

\[\Omega = (S, PNP, F_p, H, D, O)\]

is effectively infinite.

Throughout history, tests were developed as answers to very specific (often industrial) problems: Student’s t-test, Fisher’s exact test, Wilcoxon rank-sum, ANOVA… Each of them is just one point in the infinite space of possible tests. Some are more useful and general, so they make it into textbooks. The hope is: if you can correctly map your problem to one of these points, you can apply the corresponding test.

But navigating this space is not straightforward. Different textbooks, authors, or software packages organize it differently. One might start with the number of samples, another with distributional assumptions, another with independence. There is no universal decision tree that always lands you on the right test.

The Problem of Rigidity

One challenge in teaching NHST is that we rarely show how to extend tests. Suppose you’ve identified the “correct” test - but your problem has a slight twist: maybe the null hypothesis isn’t standard, or one assumption doesn’t quite hold. Do you invent a whole new test? Or can you adapt the existing one? This question is often left unanswered.

In reality, many of our assumptions are hidden or unknown. We might not even realize what we’re assuming about independence, variance, or distributions until things go wrong.

And even when people do their best: A striking demonstration of this problem comes from a recent study in social science [1], where 73 independent research teams analyzed the same dataset to test whether immigration affects support for social policies. Despite identical data, teams’ conclusions ranged from strongly negative to strongly positive effects. The researchers found that the analytical choices themselves explained very little of the variation — most of the differences came from a “hidden universe of uncertainty”. This perfectly illustrates how, in NHST, even skilled analysts can reach opposite conclusions because the framework doesn’t make assumptions explicit or robust. Bayesian modeling, by contrast, encourages transparency: priors, likelihoods, and hierarchical structure make the uncertainty visible rather than hidden behind a p-value.

Worse, academic publishing often reinforces rigidity. Papers sometimes get rejected not because the analysis was flawed, but because it didn’t fit into a standard NHST interval or template. A Bayesian analysis may provide clear and valid evidence, yet remain “uninterpretable” to reviewers trained only in p-values.

The Bayesian Alternative

Bayesian data analysis takes a different route. Instead of navigating the infinite map of pre-built tests, you build a model. You write down your assumptions explicitly (priors, likelihoods), combine them with your data, and then compute the posterior distribution. From there, you can answer the questions that actually matter: how likely is a parameter to fall in a certain range? What is the probability that one hypothesis is more supported than another?

Mathematically:

\[p(\theta \mid X) \propto p(X \mid \theta) \, p(\theta)\]

$p(\theta)$: your prior, the assumptions you bring.
$p(X \mid \theta)$: the likelihood, describing how data arise under the model
$p(\theta \mid X)$: the posterior, what you learn after seeing the data

From the posterior, you can compute directly:

$\Pr(\theta > 0 \mid X)$, the probability that an effect is positive
Credible intervals for parameters
Predictions for future data

In this view, inference is not about finding the right test in $\Omega$. It’s about writing down a plausible model and letting Bayes’ theorem do the work. The model is the test.

This approach is more flexible. It’s essentially test-free. The model is the test. If you can specify it, you can run it. Instead of memorizing dozens of special-purpose procedures, you work in one unified framework. The quality of the results depends on the quality of the model - but at least the assumptions are visible, not hidden.

This is only widely feasible now that we have the computational power to sample from posteriors. For most of the 20th century, NHST made sense as the practical choice. But today, Bayesian methods are gaining traction, even if they’re not yet mainstream. Many good Bayesian papers are still rejected by journals locked into NHST norms - but the field is shifting.

Of course, this doesn’t mean Bayesian inference is truly “test-free”. Instead of choosing among a fixed menu of NHST procedures, you’re choosing among an effectively infinite space of models: priors, likelihood forms, hierarchical structures, approximations. Those choices can be just as contentious as picking the “right” statistical test. The key difference is transparency. In Bayesian analysis, assumptions are visible and debatable, rather than hidden inside the machinery of a pre-packaged test.

A Note on Machine Learning

Some might ask: why not just use machine learning (ML) instead? After all, ML also seems test-free. The difference is data regime and goals. In machine learning, we often have massive datasets. The model is “learned” automatically, with fewer explicit assumptions, and optimized for prediction, not interpretability.

Bayesian analysis sits in a different niche: smaller datasets, richer models, and a need for interpretable uncertainty. Instead of black-box predictions, you get probabilities and insights grounded in your domain knowledge.

In some sense, ML lets the data find its own place in $\Omega$, but without making explicit which assumptions are being chosen. Bayesian modeling, by contrast, forces you to declare your assumptions and gives you transparent probabilities rather than opaque predictions.

Do we still need to learn Bayesian Data Analysis, you might ask? My answer is yes - there is still value. If you don’t have enough data to apply machine learning, Bayesian analysis remains one of the most powerful ways to obtain a satisfying result. But even if you never directly apply it because you do have enough data, many of the methods that improved deep learning in recent years can be understood as essentially Bayesian ideas in disguise.

Take fine-tuning for example: starting from a pre-trained model is nothing more than using a strong prior $p(\theta)$, and then updating it with new data $X_{\text{new}}$ via Bayes’ rule,

\[p(\theta \mid X_{\text{new}}) \propto p(X_{\text{new}} \mid \theta), p(\theta).\]

What practitioners call “adapting weights to a new dataset” is simply posterior updating.

Or consider dropout: at training time, we randomly mask neurons, which in a Bayesian interpretation corresponds to integrating over a distribution of thinned networks. This can be formalized as an approximation to Bayesian model averaging,

\[p(y \mid x, X) \approx \tfrac{1}{T} \sum_{t=1}^T p(y \mid x, \theta_t),\]

where each $\theta_t$ is a sampled subnetwork under dropout.

Style transfer can also be read through a Bayesian lens, though here the interpretation is more metaphorical than standard. The process of generating an image $I$ that balances fidelity to a content image $I_c$ with similarity to the distribution of a style image $I_s$ looks like a posterior tradeoff:

\[p(I \mid I_c, I_s) \propto p(I_c \mid I)^{\alpha} , p(I \mid I_s)^{\beta},\]

where the exponents $\alpha$ and $\beta$ act like prior strengths. In practice, ML papers frame this as an optimization of loss functions, not as Bayesian inference. Still, the analogy is useful: what looks like a balancing of competing objectives can often be reframed as a Bayesian updating problem.

Even regularization in its most basic form has a Bayesian interpretation. For instance, L2 regularization (weight decay) corresponds to placing a Gaussian prior on the parameters,

\[p(\theta) \propto \exp\left(-\tfrac{\lambda}{2} |\theta|^2\right).\]

Training with a penalty is just maximum a posteriori (MAP) estimation.

So, even in a world dominated by machine learning, Bayesian analysis has enduring value: it provides a language and framework that helps us see the hidden logic behind many of the tools we already use.

In the End

NHST has given us a toolbox of powerful, historically useful procedures. But it’s a toolbox built from scattered points in an infinite design space. Bayesian data analysis offers a more direct route: start from your model, combine it with your data, and let inference follow naturally.

It doesn’t eliminate the need for judgment, but it does make the process clearer, more flexible, and - in my view - closer to how science should reason under uncertainty.

References

[1] 10.1073/pnas.2203150119

Better Scientific Writing Prompts

2025-05-31T23:38:42+02:00

Large Language Models (LLMs) have significantly changed how we write. For instance, until recently, the word “delve” was a clear signal that a text had been generated by ChatGPT. This was traced back to training data curated by a team - many based in Nigeria - where “delve” appeared far more frequently than in American English. Over time, however, this kind of signal has become less reliable, as people have unconsciously adopted LLM writing patterns.

Personally, LLMs have reshaped my workflow. As a non-native English speaker, I use them to improve sentence clarity. I still write the content myself, and use models mainly to clean and polish the output. But I also supervise students who now submit reports heavily reliant on LLMs - often without proper editing. The results are longer, more repetitive documents, filled with vague or empty phrasing. It’s increasingly common to find multiple paragraphs repeating the same idea with only minor word changes.

But I’ve found a small secret weapon to improve my writing - and I secretly hope my students, or others, will come across this post and use it too.

The trick is this: In most of my prompts, I include the phrase “without fluff.” Just those two words often make the output noticeably better - more direct, more precise, and easier to read.

I believe I came up with this on my own, but it’s possible I read something similar before and it stuck with me.

So, I’m sharing this tip here.

It’s important to note that while adding “without fluff” effectively reduces verbosity and improves clarity, it does not address all challenges of AI-generated writing—such as ensuring factual accuracy or deeper logical coherence.

Bonus: it removes excessive praise like “Great question!” or “Interesting point!”.

Examples

To demonstrate how much this changes the result, here are some side-by-side comparisons using a typical scientific writing prompt:

Rewriting

Prompt: But, I found a secret weapon to improve further my writing and I (secretly) hope that my students or other students around the world will find this post and use it.

Without “without fluff”: But along the way, I stumbled upon a surprisingly powerful little secret that has significantly elevated the quality of my writing - and I can’t help but quietly hope that one day, perhaps even by chance, some of my students or aspiring writers around the world will discover this very post and take advantage of it too.

With “without fluff”: But I’ve found a small secret weapon to improve my writing - and I secretly hope my students, or others, will come across this post and use it too.

Text summarization

Prompt: “Summarize the contribution of the paper in one paragraph.”

Without “without fluff”: The paper delves into an exploration of recent advances in transformer architectures, particularly focusing on their applicability to natural language understanding tasks. It offers a comprehensive overview of the model design, training strategies, and benchmark results, showcasing its promising potential in improving downstream performance. The authors highlight the novelty of their contribution while emphasizing the implications for future research.

With “without fluff”: The paper introduces a modified transformer architecture for natural language tasks. It details the model design, training method, and benchmark results. The main contribution is a performance improvement on standard datasets using fewer parameters.

Strong Types in Scientific Software - Safety and Pitfalls

2025-05-31T23:38:42+02:00

Why I Use Rich Types in My Code — Even for Temperature

I’ve seen many interesting posts online about programming with types, especially around the idea of avoiding naked primitives like int, float, or size_t for domain-specific values. Instead, we should define real types that carry meaning.

A Classic Problem

Here’s an example of what can go wrong when everything is just a number:

fn get_money(an: usize) -> f64 {
    let account = get_account(an);
    account.get_balance()
}

fn get_birthday(un: usize) -> Date {
    let user = get_user(un);
    user.get_birthday()
}

let un: usize = 123456; // user number  
let an: usize = 147258; // account number

get_money(un);    // no compile error!
get_birthday(an); // still compiles!

There’s no type distinction between a user ID and an account number. Swapping them leads to nonsense, but the compiler can’t help us.

Now let’s add some real types:

struct UserNumber(usize);
struct AccountNumber(usize);

fn get_money(an: AccountNumber) -> f64 {
    let account = get_account(an);
    account.get_balance()
}

fn get_birthday(un: UserNumber) -> Date {
    let user = get_user(un);
    user.get_birthday()
}

let un = UserNumber(123456);
let an = AccountNumber(147258);

get_money(un);    // compile error!
get_birthday(an); // compile error!

Much better. The compiler now guards us from mixing up concepts that don’t belong together.

Encoding Domain Logic in Types

Strong types don’t just prevent mistakes—they let you encode logic directly. Consider temperature conversion between degrees Celsius and Kelvin:

struct Kelvin(f64);
struct Degrees(f64);

impl From<Degrees> for Kelvin {
    fn from(d: Degrees) -> Self {
        Kelvin(d.0 + 273.15)
    }
}

impl From<Kelvin> for Degrees {
    fn from(k: Kelvin) -> Self {
        Degrees(k.0 - 273.15)
    }
}

impl Add<Degrees> for Kelvin {
    type Output = Kelvin;

    fn add(self, d: Degrees) -> Kelvin {
        let d_in_k = Kelvin::from(d);
        Kelvin(self.0 + d_in_k.0)
    }
}

Now we can write something like:

fn update_temperature(k: Kelvin) -> Kelvin {
    let delta: Degrees = some_sensor_reading();
    k + delta
}

The domain logic is now compiler-enforced. You can’t accidentally add one Kelvin to another and get Degrees, or mix up raw floats. The code becomes safer and easier to reason about.

When the Cost Shows Up

But this type safety can come with a price. Imagine a loop doing conversions repeatedly:

let mut d = Degrees(10.0);

for _ in 0..100_000 {
    let k: Kelvin = d.into();
    d = k.into();
}

The conversions themselves are trivial in this case, but this pattern could appear in more complex forms. In domains like graphics, robotics, or signal processing, conversions might involve expensive matrix operations, non-linear transforms, or interpolations. If those happen repeatedly inside a tight loop, performance can degrade without obvious symptoms—especially if each type’s logic is abstracted behind constructors or From traits.

Types help you write correct code. But they can also hide inefficient behavior if you’re not careful.

When Types Multiply

There’s a subtle but dangerous pattern I’ve encountered in real systems — a kind of diamond-shaped overload problem, reminiscent of the classic multiple inheritance issue.

It usually starts with good intentions: someone introduces a new type because it makes certain operations faster, cleaner, or more expressive. For example, a researcher working on large-scale graph computations might use sparse matrices — they’re ideal for representing structures like adjacency graphs or document-term matrices, where most entries are zero. Later, another part of the codebase needs to feed that data into a deep learning model — which requires dense tensors, especially if it’s running on a GPU where sparse operations aren’t well supported.

So now, both representations coexist in the code. Each serves a specific set of algorithms well: the sparse matrix unlocks efficient traversal and storage for linear solvers; the dense tensor enables convolutional layers and matrix multiplications to run efficiently on hardware accelerators.

But then someone writes a tool that calls into both modules — maybe running in a loop a preprocessing step optimized for sparse data, followed by a model pass that expects dense input. Suddenly, the data is silently bouncing back and forth between formats. What started as two justifiable abstractions begins to trigger a chain of invisible and expensive conversions.

A common workaround is to cache both formats, converting once and storing the result. But this creates another problem: what if one version is updated? For instance, the dense tensor is normalized or augmented after a model pass — but now the sparse matrix is stale. Keeping both in sync becomes tricky, especially if multiple functions touch different representations in different orders. You can’t just “store both” unless you also coordinate updates and ownership — which is a major source of bugs and performance regressions in real-world systems.

Even worse, these issues rarely show up as compiler errors. They creep in as slowdowns, inconsistent results, or strange numerical mismatches — and by the time you notice them, the conversion logic is buried under layers of calls.

One might argue this whole problem could be avoided with a solid roadmap and better planning. And in principle, yes — if the software had clearly defined data flow boundaries and strict architectural guidelines, these type collisions wouldn’t occur.

But in practice, scientific software is rarely built this way.

Much of the work is exploratory by nature: researchers try out new algorithms, adapt to new data, or prototype pipelines under time pressure. Dozens of contributors might touch the same codebase — each with different goals. One person adds a faster solver. Another integrates a new model. A third tweaks the data loader to support a new format. No one sets out to break the abstraction — but slowly, unintentionally, the code becomes a patchwork of overlapping types and assumptions.

That’s when the silent costs appear.

In the End

Using expressive, domain-specific types helps encode meaning, prevent bugs, and clarify intent. They serve as a form of documentation the compiler can enforce. And in most applications, the cost is negligible compared to the clarity and safety they offer.

Still, it’s important to be realistic about what they solve. For transparency, using rich types can reduce misuse and verbosity, but they won’t prevent all issues—like factual inaccuracies, incorrect assumptions, or inefficient structure. Just as telling an LLM to write “without fluff” won’t make it factually accurate, adding types won’t guarantee deeper correctness in your system. But both are simple, useful tools that help avoid common pitfalls—and that’s often more than enough to justify them.

A Tribute to Ramanujan’s Genius

2025-01-05T03:30:23+01:00

Srinivasa Ramanujan (1887–1920) is one of the greatest mathematicians in history.

Ramanujan’s life and achievements are nothing short of extraordinary. Born in modest circumstances in Erode, India, he rose to global acclaim through his unmatched mathematical brilliance. His insights, many of which came intuitively, have profoundly influenced number theory, infinite series, and continued fractions, among other fields. Sadly, he passed away way too soon.

A year ago, I came across this detailed biography of Ramanujan and felt compelled to share it, it is now done. The article beautifully captures the essence of his life and contributions. For those unfamiliar with his work or story, this is an excellent starting point.

One particularly fascinating episode from the article describes Ramanujan’s first interaction with Cambridge mathematicians G. H. Hardy and J. E. Littlewood with pages filled with wild and fantastic theorems, all without proofs!

Reading about Ramanujan’s life is both inspiring and humbling, a reminder of the power of relentless perseverance. From his early struggles to his remarkable collaboration at Cambridge, Ramanujan’s journey is a testament to the enduring impact of intellectual curiosity and creativity.

To ensure I never lose access to this treasure, I’ve saved a pdf of the article. If you’re as fascinated by Ramanujan’s genius as I am, take a moment to explore this tribute to his legacy. You will not regret it.

Rust Workspaces

2025-01-02T04:05:00+01:00

While working on a Rust project involving multiple libraries and applications, I explored Rust workspaces to manage interconnected components. I found the process somewhat unintuitive and poorly documented, especially when it came to importing libraries into applications. To streamline future efforts, I’ve summarized the steps I followed here.

Why Use Workspaces?

For small projects, a single binary or library is sufficient, created with a simple command like:

cargo new app --bin

However, for larger projects involving multiple libraries and applications that need to interact with each other, Rust workspaces provide a convenient way to organize the structure.

Workspace Structure

Here’s an example structure I used for my project:

.  
├── Cargo.lock  
├── Cargo.toml  
├── app/  
│   ├── app1/  
│   │   ├── Cargo.lock  
│   │   ├── Cargo.toml  
│   │   └── src/  
│   │       └── main.rs  
│   ├── app2/  
│   │   ├── Cargo.lock  
│   │   ├── Cargo.toml  
│   │   └── src/  
│           └── main.rs  
├── lib/  
│   ├── lib1/  
│   │   ├── Cargo.lock  
│   │   ├── Cargo.toml  
│   │   └── src/  
│   │       └── lib.rs  
│   ├── lib2/  
│   │   ├── Cargo.lock  
│   │   ├── Cargo.toml  
│   │   └── src/  
│           └── lib.rs  
└── tests/  
    ├── some-integration-tests.rs  
    └── multi-file-test/  
        ├── main.rs  
        └── test_module.rs  

Setting Up a Workspace

Create the Workspace Root

Start with an empty directory containing a Cargo.toml file:
```
[workspace]  
members = []  
```
It would be nice to have a cargo new --workspace workspaceName command, but apparently, it is not implemented.
Create Applications and Libraries

In the same folder, run the following commands to create your applications and libraries:
```
cargo new app/app1 --vcs none --bin  
cargo new app/app2 --vcs none --bin  
cargo new lib/lib1 --vcs none --lib  
cargo new lib/lib2 --vcs none --lib  
```
The --vcs none flag prevents each subproject from initializing its own Git repository.

These commands will automatically update the members section in the workspace’s ./Cargo.toml file, located in the root directory. The resulting file should look like this:
```
[workspace]  
members = ["app/app1", "app/app2", "lib/lib1", "lib/lib2"]  
```
Adding Dependencies in the Workspace

To make library dependencies available to all applications, add them to the workspace’s ./Cargo.toml:
```
[workspace.dependencies]
lib1 = { path = "lib/lib1" }
lib2 = { path = "lib/lib2" }
```

Adding Dependencies

To use lib1 in app1, add it as a dependency in ./app/app1/Cargo.toml:

[dependencies]
lib1 = { workspace = true }

Then, in ./app/app1/src/main.rs, you can reference it like this:

use lib1;  

fn main() {  
    println!("Hello, world!");  
    println!("{}", lib1::add(1, 2));  
}  

Building and Running

To build the entire workspace, use:

cargo build

To execute a specific application (e.g., app1):

cargo run --bin app1

By following this structure, you can effectively manage complex Rust projects while maintaining a clear organization of your codebase.