The Quixotic Quest of Interpretability

In the ongoing effort to understand artificial intelligence, few pursuits have taken on such a romantic air as the quest for interpretability. Researchers enter the dense interior of deep neural networks with the conviction that meaning can be found somewhere inside, provided one decomposes the system far enough and with enough ingenuity. They peer into neurons and circuits, inventing ever more refined tools to make sense of the hidden layers. The dream is that one day the black box will yield its secrets. It is hard not to see a touch of Don Quixote in this endeavor, charging at vast assemblies of weight matrices as if they were dragons.

The project has genuine accomplishments. Some researchers have identified circuits that detect edges or shapes or fragments of familiar objects in image classifiers. Others have succeeded in reverse-engineering tiny networks trained on arithmetic tasks, showing how the weights encode a surprisingly intricate algorithm. These discoveries suggest that interpretability can produce real scientific insight. And the moral urgency is obvious. If we are to rely on neural networks in areas like medicine, finance, or criminal justice, then surely we must understand why they make the decisions they do. Few would willingly entrust a parole hearing or a cancer diagnosis to a system whose reasoning remains entirely opaque.

Yet at the very heart of this work lies a stubborn problem. Neurons do not behave like neat containers of meaning. They are polysemantic, responding to multiple and often unrelated features. A single unit might activate for a citation, a snippet of Korean, and an HTTP request all at once. Faced with this confusion, researchers try to disentangle features using sparse autoencoders or clever probing techniques. They multiply their methods in the hope of isolating simpler building blocks. Each method produces a measure of clarity, but the overall picture remains strangely elusive. The more one tries to decompose, the more the dragon seems to dissolve into mist.

Here philosophy can offer a different perspective. Heidegger argued that meaning does not reside in isolated parts of a system, whether neurons or gears or atoms, but arises in the disclosure of a world. Things become intelligible not through the intrinsic properties of their components but through the way they show up within a horizon of use and significance. A hammer is not defined by its molecules but by its role in a world of building and repairing.

If we take this seriously, then the interpretability project is hunting in the wrong place. Transformers do not contain meaning in their hidden layers. They produce meaning only in their output, when the text they generate forms a simulacrum of worldhood. It is in the flow of language that words begin to hang together in ways that disclose possibilities. What we find inside the network are correlations, superpositions, and circuits that implement statistical tricks. These are the ontic mechanisms. The ontological phenomenon of meaning arises only when the network outputs coherent text that can be received as part of a world.

This does not make interpretability useless. It can still improve safety by identifying circuits that correlate with deception or bias. It can help science by revealing patterns that lead to new discoveries, as AlphaFold did for protein folding. It can help engineers by providing tools for debugging and refining large models. But if the hope is to find atoms of meaning buried in the parameters, then that hope is quixotic. The noble achievements of interpretability will likely be pragmatic and provisional rather than metaphysical.

Perhaps this is the real lesson of Don Quixote. His quest was not without value, even if he mistook windmills for giants. Interpretability research may likewise prove invaluable, even if its central dream is misguided. Meaning will not be uncovered by cataloguing neurons. It will only appear in the larger phenomenon of simulated worldhood. If we wish to understand how these systems speak to us, then we must look not into the gears of the mechanism but at the way their words disclose possibilities. The search for meaning in the parts may never end, but the meaning itself is already before us, written in the text they generate.