KV Cache Compression Is The Symptom, Not The Disease
KV Cache Compression Is The Symptom, Not The Disease
KV-cache compression has become one of the central practical problems in language-model inference, which is not surprising once you leave toy demos behind. Once context length moves into the regime people actually want to use, the cache stops being an implementation detail and becomes the dominant systems object. It sets the memory bill, constrains throughput, and increasingly dictates which architectural ideas even look feasible.
So a lot of current work has converged on the same surface objective — quantize the cache, evict it more intelligently, summarize it, sparsify it, predict which entries matter and skip the rest. Google’s recent TurboQuant is a good example of the impulse. If KV storage is the active bottleneck, hammering down its precision buys real capacity, throughput, and deployment headroom. That work matters; it is not cosmetic.
Still, all of that work sits on top of a more basic question that I find more interesting: why is the object we are caching so large to begin with?
The standard answer is that attention needs keys and values for a long history, and those vectors live in a big continuous space. That is true, but it does not really explain the scale of the asymmetry. In my own runs, each query head generally has between 4 and 20 active supports, which is the part of the story that ought to make you squirm — we preserve enormous per-layer continuous state so that a single head can consult a couple handfuls of positions. Operationally you respond with better compression or sparsity; intellectually, the asymmetry ought to force a different question. If the active support is usually that small, why is the cached object itself so bloated? What are we warehousing per token, per layer, that makes extreme compression both necessary and, increasingly, possible?
My answer is that the KV cache is expensive because it is not storing a clean memory object — it is storing snapshots of an overloaded residual stream.
That is a stronger claim than the usual “memory of the past” story, so I want to say it plainly. The residual stream in a modern transformer is not merely shuttling token semantics forward. It is also carrying partial context inference, intralayer communication signals, routing residue, intermediate scaffolding, output-aligned corrections, and the accumulated detritus of earlier writes, and we cache those layerwise states as though they were well-factored units of memory. They are not; they are multiplexed state containers.
If you adopt that frame, a bunch of observations that usually get discussed separately start to collapse into one account. KV-cache compression stops looking like an isolated systems patch and starts looking like downstream damage control for a representational problem. Norm growth starts to look like compensation, late-layer amplification starts to look like forcing the representation back into something the LM head can read, and even optimizer choices start to look different — AdamW as signal recovery on top of a bad transport layer, not just the boring default everyone reaches for out of habit.
None of this is an argument against KV compression. Compression work is useful precisely because the present architecture is so profligate with state. The claim I am making is narrower; that the memory problem is downstream of a deeper architectural decision. We ask one additive state container to do too many jobs, then cache the result as if it were an irreducible semantic object, and the first step toward clarity is to be more honest about what the cache is actually storing.
What The KV Cache Is Actually Storing
At the level of implementation, the answer sounds simple. For each layer and each prior position, we cache keys and values so future queries do not recompute them. Phrased that way, the KV cache sounds like a straightforward memory of the past — but that description only holds if you decline to look at what is actually inside the tensors. A key and a value are not primitive memories; they are projections of the layer state:
$$ k_{\ell,t} = W^K_{\ell} h_{\ell,t}, \qquad v_{\ell,t} = W^V_{\ell} h_{\ell,t}. $$So the cache inherits whatever structure, ambiguity, and contamination already lives inside $h_{\ell,t}$. If the residual at layer $\ell$ were a clean representation of token semantics at position $t$, the story would be relatively benign — each layer would expose different queryable views of a well-factored object. That is the innocent version of the narrative, and it is not the version I believe.
In a residual transformer, the state at depth $\ell$ is not a pristine token representation. More importantly, it does not remain in any strong sense an embedding-space object for very long. The input embedding supplies an initial condition, but the useful linearity rapidly shifts to the residual-stream readout through the LM head. After the first few layers, hidden states are better described by how they project through the output map than by how they entered through the token embedding table. The null space shifts with it. Almost immediately, the model is no longer operating in the null space of the embedding geometry; it is operating in what is better described as the LM-head or middle-thought universe, where directions matter according to their downstream readout or their utility to later computation.
Formally, the state evolves as a cumulative series of writes:
$$ h_{\ell,t} \approx e_t + \sum_{j < \ell} \Delta_{j,t}. $$That sum is already doing too much work. Some terms are carrying lexical or semantic content forward. Some are resolving underspecification the tokenizer left behind. Some are partial context inferences. Some are coordination signals meant to be useful two or six layers later. Some are output-aligned pushes that matter for the readout. Some are cleanup terms that exist only because earlier writes polluted the state. And some are, frankly, just the residue of having transacted too many incompatible computations on one additive container.
This is why I do not like describing the KV cache as a memory of tokens; it is more accurate to call it a memory of rewritten states, and that wording is not pedantic — a rewritten state can carry information in at least two very different geometric modes. It can carry directions that are immediately legible to the current readout, and it can carry directions that are comparatively silent under that readout but still legible to downstream layers. This is the first place where null space becomes more than a metaphor. Once the embedding-transition has occurred, the relevant null space is not best understood relative to the input embedding table. It is relative to the output-side readout and to the later computations that continue to write into the residual stream. If a component of the state has little immediate effect on the current logits, it has not thereby become irrelevant. It may still be functioning as deferred computation, side-channel state, or a coordination signal for later blocks. In other words, the cache is preserving not just what the model is saying to the LM head, but also what the stack is saying to itself.
That is one reason the support-size asymmetry is so revealing. A given query head may generally draw on under 20 prior positions, but the positions it is reading are not simple token records. They are layer-specific mixtures of token identity, inferred context, positional structure, latent routing cues, and whatever other features survived earlier additions. So the apparent contradiction is resolved in the worst possible way: the model reads from only a few locations, but each of those locations contains far more mixed state than a sane architecture would want to preserve.
This also clarifies why compression papers can keep finding headroom. If the cached object were already close to an irreducible semantic record, the room for compression would be modest and fragile. Instead, the object is highly structured, highly redundant, and full of quantities whose importance is conditional. Some components matter to the current query. Some matter only to specific later layers. Some live mostly in the null space of the current readout and become salient only after further transformation. Some exist because the model never had a separate channel in which to store them cleanly.
When I say the KV cache is bloated, I do not only mean the tensors are big — I mean the architecture has failed to factor what it preserves. Token identity, context repair, layer-to-layer communication, and output preparation are braided together before the cache is ever written, the rest of the argument follows from that. If the cached object is already a multiplexed residual state, the next question is not really about compression — it is why the residual stream became such an overloaded bus in the first place.
The Residual Stream Is Pure Addition
The answer starts with a fact about transformer mechanics that is so familiar people stop hearing it: the residual stream is additive. Residual addition gets sold as an optimization convenience or as the generic trick that makes deep nets train, and both are true — but in a transformer, additive composition is not just convenient. It is the central constraint everything else has to live inside.
To first order, the state at depth $\ell$ is obtained by repeatedly writing additive deltas into a shared container:
$$ h_{\ell+1,t} = h_{\ell,t} + \Delta^{\text{attn}}_{\ell,t} + \Delta^{\text{mlp}}_{\ell,t}. $$Unrolled over the stack, this yields the familiar decomposition:
$$ h_{\text{final},t} = h_{0,t} + \sum_{\ell} \Delta^{\text{attn}}_{\ell,t} + \sum_{\ell} \Delta^{\text{mlp}}_{\ell,t}. $$There is a consequence here that is easy to miss if you only think about expressivity. Addition is cheap, but it is not selective. A residual write does not carve out a protected subregion, isolate a channel, or mark lexical state versus routing state, and it does not keep provenance — it just accumulates, which means every block writes into the same settlement layer.
If the architecture had separately factored token transport, deferred computation, context features, and layer-to-layer coordination, then addition would be relatively harmless. Each write would enter a channel whose semantics were already constrained. But that is not what the modern residual stream is asked to do. The same state container must carry lexical content, the repair work forced by the crimes of the tokenizer, partial context inference, attention-routing side information, and later output preparation. The additive mechanism does not distinguish among them; it merely superposes them.
This is why I keep using the word multiplexed. The residual stream is not overloaded in some vague rhetorical sense. It is overloaded in the literal systems sense that several distinct signal classes are being transported through one shared medium without a principled factorization scheme.
There is a further inference here that I think is stronger than it may initially sound. Attention and MLP writes do not appear to have the same signature on the residual stream. The traces are not consistent with two interchangeable sources of additive update. Attention seems to behave more like selective routing and retrieval: low-support, query-conditioned, often structurally local or sparsely long-range, and relatively constrained by where the head can read. MLP writes, by contrast, look denser, higher-energy, and more directly involved in geometric recoding of the state. The extreme early and late magnitudes show up in MLP contributions, not in some symmetric attention/MLP pattern, and in my intervention work MLP edits are consistently more damaging than attention edits. The natural reading is that attention mostly decides what to bring into play, while the MLP is more often the mechanism that writes heavily into residual space, repairs distortions, and pushes the state into output-legible directions.
I do not mean this as a universal theorem about every model family and every depth. I mean it as the strongest explanation of the evidence I have. If that explanation is right, then the residual stream is not merely a sum of many writes. It is a sum of qualitatively different write classes, with attention acting more like sparse selection and transport, and MLPs acting more like high-energy rewriting operators. Once both are forced into the same additive container, the geometry becomes even less separable than the standard textbook picture suggests.
Once you accept that, two corollaries are hard to dodge. Contamination is the default. If a layer writes something useful for a later block but quiet under the present readout, that write still occupies geometric real estate; cleanup terms are just more writes into the same pot; features living in the null space of the current readout are still there for every later projection — nothing is really gone, it is overlaid. Provenance is also lost almost immediately, because a later layer sees the sum of prior decisions, not an annotated list. That is great for the forward pass and lousy for clean credit-assignment.
At that point a lot of “mysterious” transformer behavior starts to look like bookkeeping. Early mistakes stick around in weird ways because they are not deleted, only buried under imperfectly aligned additions. Late layers look like they are doing cleanup as often as reasoning because correction has to be yet another write on the same bus. There is headroom for aggressive cache compression because the cached object is not a minimal semantic record — it is an additive superposition whose pieces matter conditionally on layer, head, and query. The geometry goes sideways because one state has to hold both what matters for the current readout and latent scaffolding for later writes; the null space is not a harmless leftover, it becomes working storage.
If the stream were merely additive but every write were small and aligned, the design would still be inelegant but maybe tolerable. In practice the model has to make the shared container do real work across depth, which is where norm growth shows up — the accumulation is not passive, later layers need leverage over an increasingly burdened state, and the cheapest lever in an additive system is often to write harder. So the interesting question is not whether norm growth exists, but why you should expect it.
Norm Growth Is Not A Bug, It Is Compensation
Once you treat the residual stream as a shared additive bus, norm growth stops being a curiosity and starts looking like the default for a system that cannot cleanly separate state but still has to carry useful influence down the stack.
The naive picture is that each layer adds a modest refinement and everything composes gracefully — if that were true, hidden norms might wobble but would not structurally drift upward, and later layers would not need to shout to be heard. That is not the regime we are in. Later layers inherit a state already burdened by lexical transport, context repair, routing residue, scaffolding, and partial output shaping; some of that mass is useful, some is latent, some is noise from the vantage of the present block, and all of it is still there because the stream is additive and nobody gets a clean slate — only the running total.
Another way to picture this is that the stack is building what I like to refer to as a polytope tower. Each layer carves a new geometric object out of the state inherited from below, but the result is not stored in a fresh typed channel. It is written back into the same residual medium that must continue serving transport, coordination, and readout preparation. So the tower does not rise in a clean latent substrate. It is continuously folded back into the same overloaded buffer from which the next layer must work. And because the attention mechanism has to remain trainable, those polytopal selections are not left with fully jagged combinatorial edges. Softmax rounds them just enough for signal to pass upstream. That rounding is useful for optimization, but it also means that the architecture pays for differentiability with additional mixing.
Under those conditions, leverage is the real problem. Suppose a late block needs to move the state in a direction that is strongly legible to the LM head. It is not writing into an empty vector. It is writing into the accumulated sum of everything the model has already chosen not to separate. If its write is too small, it is simply one more term in the pile. If its write is well aligned but weak, it may remain trapped among earlier components that occupy adjacent directions or dilute its effect after normalization. If the model has learned to store deferred computation or side-channel state in the null space of the present readout, that state does not disappear merely because the current block would prefer a cleaner trajectory.
So the architecture rewards force. In an additive system, one of the simplest ways to recover influence over a burdened state is to increase the norm of later writes, especially along output-relevant directions. You do not need to erase the earlier state in any principled sense. You need only make the current signal sufficiently dominant that, after the next normalization and projection, the model reads the direction you care about more strongly than the accumulated residue surrounding it.
This is why I think norm growth should be interpreted as compensation rather than pathology in the narrow sense. The pathology lies earlier, in the failure to factor the state. The norm growth is what a competent optimizer discovers after that failure has already been built into the transport layer.
Put differently, once you force many roles through one additive container, norm growth is one of the few degrees of freedom left for later layers to reclaim agency.
There is a useful connection here to the recent Neural Thickets result by Gan and Isola, which argues that large pretrained models live in neighborhoods dense with diverse task-improving specialists rather than isolated single-task optima — I do not read that paper as proving my claim, but it gets easier to picture once you think in terms of output-side leverage instead of a globally tidy internal state. If later computation is organized around output-side leverage rather than around preservation of a globally clean internal state, then “task vectors” should not be expected to behave like lonely axes in weight space. They should instead be surrounded by nearby specialist perturbations that redistribute force across neighboring output-relevant directions.
More strongly, those specialists are forced to carry their own redundancy. In a better-factored architecture, some of the support required for a task would live in explicit channels: separate context state, cleaner transport, less polluted intermediate memory, less need to repeatedly restate what matters. But in the present design, much of that redundancy has to be rebuilt inside the perturbation itself. A local specialist cannot rely on the architecture to carry clean task support forward for free. It has to push enough aligned structure into the same overloaded residual bus to survive the surrounding contamination, normalization, and readout. That makes it much easier to understand why a pretrained neighborhood can be dense with specialists. They are not tiny, perfectly isolated task needles. They are nearby redundant ways of forcing the model into sufficiently similar output-side behavior.
Once useful linearity concentrates at the residual-to-LM-head interface and later layers lean on amplitude to punch through a burdened state, local Gaussian neighborhoods around pretrained weights being full of nearby “experts” instead of one canonical solution stops sounding like a pathology — the task manifold near a pretrained model probably should look thick. The growth does not even need to be uniform to matter. The important issue is not whether every layer monotonically increases the global norm by the same amount. The important issue is that the stack develops regions in which stronger writes become functionally necessary. My own traces already point in that direction. The dramatic magnitudes are not evenly distributed. They show up where the model appears to be doing expensive representational work. It is forced into early repair, late alignment, and selective internal recoding. That is exactly where one would expect the additive bus to become most burdensome.
At this point, an objection usually appears. Perhaps the growing norm is simply a benign artifact of depth, or a byproduct of optimization that does not itself perform meaningful computational work. I do not think that objection survives contact with the rest of the evidence. If the norms were drifting upward in a way that was geometrically neutral, then one would not expect the late stack to become so strongly entangled with output preparation, nor would one expect final-layer interventions and output-space rotations to matter so much. But they do matter. They matter precisely because the model is using the residual stream directionally at readout, and because strength in those directions buys it practical leverage over what came before.
This is where the LM-head perspective becomes indispensable. Once the embedding-transition has occurred, the question is no longer whether the hidden state still resembles the input embedding in some global sense. The question is how strongly the current residual projects into output-relevant subspaces. A larger norm is not automatically meaningful, but a larger norm aligned with readout-legible directions is an effective way of making the later computation count.
That puts the null space back in the middle of the story. When the model needs a decisive output-side move against accumulated null-space residue, it can either factor that mass out cleanly or overpower it, and the present architecture makes the second option much easier than the first.
So when we observe norm growth across depth, I do not think the right reaction is simply “something unstable happened.” A more serious reading is that the model is paying an architectural tax. It is using amplitude to compensate for the fact that the residual stream has become an overloaded medium in which earlier contamination remains present, later control is expensive, and clean separation was never offered in the first place.
If norm growth is compensation, then one should expect the last part of the stack to use that leverage in a very particular way. Not by cleaning the state in any idealized sense, but by writing so forcefully into output-aligned directions that the LM head can still make sense of the result.
Late Layers Learn To Shout
Up to here the picture has been a bit abstract; this is where it gets concrete. If late layers inherit a residual already crowded with transport, repair, routing residue, deferred computation, and side-channel structure that partly lives in the null space of the current readout, they face a blunt problem: how do you make the final representation legible for unembedding when nobody ever gave you a clean workspace?
My working answer is that they learn to shout — not as a flourish, but as a description. Late layers seem to use amplitude, alignment, and leverage in residual space to force the state into output-legible directions strongly enough that the LM head can still read a coherent answer through the junk underneath.
That is also where the tidy distinction between “reasoning” and “cleanup” starts to wobble. In the later stack, those two activities are entangled. A layer may genuinely be carrying the computation forward, but if it is doing so in a residual stream already occupied by accumulated scaffolding, then the act of making the next step count is also an act of representational domination. The block is not merely adding another thoughtful refinement. It is writing in such a way that its preferred directions survive the rest of the state.
That is exactly the setting where null-space interaction stops being academic. Earlier layers can leave behind structure that is quiet under the present readout but still geometrically present. Later blocks do not have the luxury of pretending that structure is absent. If they cannot factor it out cleanly, they must route around it, exploit it, or overpower it. A strong output-aligned write does not need to annihilate the older state everywhere. It only needs to dominate enough of the readout-relevant geometry that, after normalization, the LM head sees what the late stack needs it to see.
This also clarifies why late-layer effects often look strangely disproportionate. If one imagines that each layer is contributing a roughly equal semantic increment, then a large terminal write appears pathological. But if one instead views the late stack as the point where burdened internal state must be made legible to a linear readout, then disproportion is exactly what one should expect. The architecture has deferred too much representational bookkeeping into a shared bus. By the end, force is cheaper than elegance.
It also explains a pattern that shows up repeatedly in refusal and abliteration work; that most of the actionable energy tends to concentrate in the later layers. That concentration is not accidental, and it is not merely an artifact of where people happened to look first. If refusal, compliance, or any other strongly output-legible behavior must ultimately be expressed at the residual-to-lm_head interface, then the later stack is exactly where one should expect the clearest, most intervention-sensitive geometry to live. Earlier layers may be necessary for setting the route, gathering context, and preparing the transport. But the later layers are where the model is forced to cash the computation out into directions the readout can actually use. Of course most of the “juice” is there — that is where the stack finally has to stop implying and start saying.
My own notes already point in this direction. The large late MLP corrections do not read like delicate semantic refinements. They read like final-stage geometric coercion. In my intervention and probing work, the useful linearity is consistently strongest at the residual-to-lm_head interface, not in the raw input embedding geometry. That is a crucial asymmetry. It means the late stack is not merely handing off whatever the model “thought” in some abstract sense. It is actively shaping the state into a basis the readout can consume.
Once you seat the story there, the LM head itself starts to look different. A linear unembedding does not need the entire residual state to be pristine. It needs sufficient directional dominance in the relevant output subspaces. If the late stack can concentrate enough aligned mass into those subspaces, the readout works. Everything else can remain in the background as long as it does not interfere too destructively after the final normalization. That is one reason the architecture tolerates so much internal pollution. It is not solving the stronger problem of maintaining a clean global representation. It is solving the weaker but still useful problem of making the final readout locally decisive.
This is also why late-layer output behavior can coexist with a remarkably dirty backward story. The forward pass only asks that the final state be interpretable enough for the LM head. It does not demand that the route by which that state was assembled be easy to invert, easy to attribute, or easy to train against cleanly. A model can become extremely competent at producing a final output-aligned shove while remaining terrible at preserving a tidy credit-assignment structure for the optimizer.
That is the real sense in which late layers learn to shout. It’s not mere volume, but compensation for being handed one burdened additive medium, told to preserve everything useful that came before, and then ordered to deliver something sharply legible to a linear readout. The last piece on the forward pass is what happens right before that readout — if late layers win by building output-aligned directional dominance, the final normalization step matters enormously because it rescales that dominance on the doorstep of unembedding.
RMSNorm Snaps The Whole Thing Onto A Hypersphere
Right before unembedding, the model performs one final act of geometric discipline: it normalizes the residual state.
That normalization matters far more than is usually acknowledged in casual discussions of transformer behavior. If the late stack has spent its effort constructing a residual whose decisive feature is directional alignment with output-relevant subspaces, then the final norm operation is not a minor cleanup. It is the gate through which all of that accumulated structure must pass before the LM head reads it.
In models with RMSNorm, the effect is conceptually simple. Magnitude is suppressed as an independent degree of freedom, and the state is rescaled according to its root-mean-square amplitude. What survives most faithfully is directional structure. In geometric terms, the model takes a vector assembled through many additive writes and snaps it back toward a hyperspherical regime immediately before unembedding.
That does two things at once. On the forward pass it is extraordinarily useful in that it prevents raw amplitude from exploding uncontrollably into the readout. It regularizes the interface between the late residual stream and the linear LM head. It allows the model to exploit norm growth internally without having the final logits become a direct referendum on absolute hidden-state scale. In that sense, RMSNorm is part of why the architecture works as well as it does. It rescues the readout from some of the excesses of the transport layer beneath it.
On the backward pass the story gets uglier. By the time the loss is computed, the final residual has already been through a normalization that redistributes magnitude across coordinates and emphasizes direction at the last possible moment. The optimizer is therefore not sending blame backward through the same geometry the forward stack used while it was building leverage. It is sending blame through a representation that has just been reconditioned at the interface to the LM head.
This is what I mean when I say the gradient becomes disjoint right at the end.
The late stack may have won forward control partly through amplitude and partly through alignment. But the loss is taken after a final operation that suppresses the autonomy of raw magnitude and re-expresses the state in a normalized form. The optimizer then has to infer, from the post-normalized error signal, which earlier writes, which amplitudes, which cleanup terms, and which latent null-space interactions were actually responsible for the final behavior.
So you do not get a clean inversion problem — you get a badly conditioned attribution problem, which also helps explain why norm-preserving interventions matter so much when you edit late-layer behavior. If the final interface is sensitive primarily to directional structure after RMS normalization, then a crude magnitude-changing intervention will not behave locally. It will bleed into unrelated coordinates through the normalization step and create collateral movement in output space that has little to do with the intended edit. By contrast, an isometric or near-isometric modification can alter which direction the state points toward while leaving the final normalization geometry comparatively intact. The architecture itself is telling you what kind of edit it prefers.
The hypersphere picture is therefore not decorative. It clarifies the bargain the model has made with itself.
Internally, the transformer tolerates a residual stream that is burdened, additive, and full of mixed-purpose state. Late layers recover control by writing in output-legible directions with enough force to matter. Then RMSNorm imposes a last-minute geometric treaty: whatever internal amplitude games were played, the final readout will happen on normalized terms.
The bargain is good enough for inference and lousy for a clean signal to send back. The optimizer never sees a transparent trail from final logits back to the individual writes that produced them. It sees the state after it has been accumulated, entangled, partially dominated by late-layer pushes, and then renormalized just before unembedding. The final signal is perfectly usable for learning in the broad empirical sense. But usable is not the same as well-engineered.
This is why I think it is a mistake to treat the final norm as a harmless detail. RMSNorm is one of the key reasons the forward pass remains legible at all, and one of the key reasons the backward pass inherits such a distorted assignment problem — the architecture solves its readout problem by creating a training problem. Once you sit with that, the optimizer stops looking accidental. If the stream is overloaded, later layers compensate by writing hard into output-aligned directions, and the final readout goes through a last-minute normalized projection, then of course the gradient coming back is noisy, partial, and badly mixed. At that point I am less interested in whether “the optimizer matters” and more interested in how much of modern transformer training is really optimizer-assisted recovery from a transport layer that should have been designed differently.
Why The Gradient Signal Is Garbage
By now the optimizer story should look less like taste and more like coping with bad architecture. Backprop is being asked to recover a usable signal from a state that has already been through additive accumulation, null-space mess, late-layer coercion, and final normalization — the gradient is not “wrong” in any formal sense, but it comes back badly mixed, without the clean factorized blame you would actually design for. That is why I say the noise is deeper than minibatch variance. One residual container has been used for too many incompatible jobs, and then the optimizer is supposed to look at post-readout error and somehow sort transport from routing from deferred computation from cleanup from the few directions that actually deserved the credit.
This is also where the optimizer split — Muon-ish behavior in the middle of the stack versus AdamW at the interfaces — starts to make sense as more than a Twitter argument. Muon works on hidden layers because its geometry is about directions in matrix space. The Newton-Schulz style update is a directional object, happy to respect structured direction without fussing over every coordinate’s scalar story, which is a natural fit for the middle of the network where the model is mostly learning transport, routing, and rewriting inside the residual-stream “universe.”
The embedding and the LM head are not like that; they are interface objects, the ingress and egress. Those are exactly the places where sign, basis, and coordinatewise agreement matter most, because these layers do not merely move information around internally. They define how internal state attaches to discrete symbols on the way in and how it is cashed out into logits on the way out.
That makes them much more fragile under updates that preserve only direction in the broad matrix sense while discarding the finer coordinatewise structure.
This is why I think the Muon/AdamW split is not an arbitrary implementation detail. Muon is effective in the intermediate layers because the intermediate layers can tolerate, and often benefit from, a geometry-aware optimizer that mainly respects update direction. AdamW remains necessary at the embedding and LM head because those boundary objects are sign-sensitive and coordinate-sensitive in a way the middle of the network is not.
Model merging makes the same point from another angle. TIES-style merging only works when there is enough sign agreement that multiple deltas are reinforcing the same local direction. Without that agreement, averaging is not synthesis; it is cancellation. DARE-TIES therefore needs sign agreement not as a cosmetic heuristic but because the merge is trying to preserve a coherent local update in a basis where opposite signs often mean genuinely incompatible edits.
That is extremely close to what happens at the embedding and readout interfaces. If two updates disagree only at a high level but fight in sign on the coordinates that actually attach tokens or project logits, you should not expect a clean merge. The same underlying issue shows up in optimization. A hidden matrix can often absorb a directionally good update even if some local coordinate details are washed out, because the downstream network can continue to rewrite and re-express that transport. The embedding and LM head do not have that luxury. They are the basis-setting and basis-reading surfaces. At those surfaces, sign agreement is not incidental. It is part of what makes the update meaningful at all.
This also explains why interface layers feel disproportionately brittle in practice. Once the model’s useful linearity has migrated to the residual-to-lm_head side, and once RMSNorm has made the final readout primarily directional, the optimizer no longer has a generous margin for sloppy interface edits. A bad hidden-layer update may be recoverable because later layers can route around it, overpower it, or rewrite it. A bad readout-side update corrupts the very basis in which the model is trying to express the final answer.
So when people say AdamW is still required at the embedding and LM head, I do not hear that as an arbitrary recipe. I hear it as a confession about where the architecture remains least factorized. Those are the places where the optimizer still needs per-coordinate memory, sign sensitivity, and magnitude adaptivity because the model has not provided a cleaner representational contract.
This is why I think the usual description of AdamW as a boring default undersells what it is doing. In the present architecture, AdamW is not merely smoothing stochastic gradients. It is often acting as a recovery mechanism for a credit-assignment problem that has already been geometrically distorted before the gradient even arrives.
Once you admit that, the rest is hard to dodge. The KV-cache mess, the late-layer concentration of intervention energy, the odd optimizer split, brittle interface edits, the need for sign agreement in merges, and the general sense that “clean gradients” in transformers are more prayed for than engineered — those are all different camera angles on the same mistake. We made one additive state container carry far too much of the model’s internal life.
Why The Crimes Of The Tokenizer And Sparse Attention Make This Worse
At this point the architecture is already in trouble even before we say anything specific about tokenization or attention. But those choices do not merely add independent inefficiencies. They intensify the burden on the same residual transport layer.
Start with what I keep calling the crimes of the tokenizer. A BPE token is usually not a semantically complete unit. It is a compressed surface fragment that arrives with missing information. Whether it is functioning as part of a larger word, what language it belongs to, what domain it is embedded in, what syntactic role it is playing, and often what semantic sense is actually active. Earlier I described this as underspecification. What matters here is the consequence. The model must repair that underspecification somewhere, and in the present design it performs much of that repair by writing inferred context back into the same residual stream that is already responsible for transporting meaning forward.
So the crimes of the tokenizer are not a separate preprocessing inconvenience. They are one of the first sources of residual pollution. More than that, they are the most natural explanation for why the first major norm spike appears so early. The earliest layers are not yet doing deep semantic synthesis in any grand sense; they are trying to turn underspecified fragments into usable state. If the input arrives as a broken or partial semantic unit, the model has to inject missing role, boundary, language, and domain information immediately. In an additive architecture, that repair appears as a large write. The first big norm event is therefore not surprising. It is the system paying its tokenization debt up front.
The support-size story in attention compounds the mess instead of relieving it — if each query head generally has a limited number of active supports, attention is already telling you something loud. It says most prior positions do not matter to the present computation except insofar as a small minority of them provide diagnostic context, structural anchors, or semantically decisive retrieval. That should have been an opportunity for architectural economy. Instead, because the relevant state is bundled into layerwise continuous representations, the model must preserve an enormous cache of mixed-purpose objects so that each head can read a few highly diagnostic locations from it.
There is also a geometric subtlety worth naming. The attention selection problem is, in an important sense, jagged. A head wants to identify a small support and form a convex combination over a few genuinely relevant vertices. Left to itself, that geometry looks more like a polytope with hard facets than a smooth diffuse cloud. But softmax is the relaxation that makes the whole thing trainable. Its job is to round those jagged edges enough that gradient can pass upstream through the score path. That is not a side detail. It is one of the reasons mixing becomes so pervasive. The architecture buys differentiability by replacing a sharper combinatorial selection with a smoother weighted blend, so even before the residual stream re-entangles everything downstream, the attention mechanism has already been encouraged to transact in softened mixtures rather than crisp support boundaries.
Worse — and this one stings in training — the same sparsity that helps the forward pass hurts the backward one. Only a tiny fraction of positions receive substantial attention, so only a tiny fraction participate strongly in the score-path gradient for any given query. That sparse selection would already create a brittle training geometry on its own. But in a residual transformer the sparse attention signal is then coupled to an additive stream that has to absorb both the retrieved information and the side effects of having retrieved it. So the architecture gets the worst of both worlds: sparse, selective reads on the attention side, and globally mixed writes on the residual side.
The result is a cascading tax. The tokenizer under-specifies the input, so early layers infer and rewrite missing structure; attention only lights up a small active support, which is an indirect admission that most historical state did not need to be kept in such rich detail for this query; but because that support lives inside a polluted additive medium, the useful signal gets written back into the same overloaded stream alongside prior repairs, hints, latent scaffolding, and output prep. By the time you reach the late layers you are not following a clean semantic trajectory — you are reading a settlement ledger of local repairs with almost no global guarantees.
This is why I do not think you can understand long-context systems work in isolation from representational design. Sparse attention by itself is not the villain. Tokenization by itself is not the villain. The villain is that both of them force additional work into the one part of the architecture that least deserves it: the shared additive residual bus.
Seen this way, the cache-compression literature looks a little different. Methods succeed partly because hardware likes smaller tensors, but also because the cache is full of state that is contingent, redundant, weakly read, or only useful to particular downstream phases — the architecture keeps forcing local fixes into the transport layer, and compression work keeps discovering how much of that layer was never a clean semantic necessity. That is why the current wave of cache papers is both impressive and, to my eye, slightly damning. It shows the system can survive brutal reductions in what it stores, and it also shows we have been storing the wrong object all along.
The Cache Is Compressible Because The State Is Structured
By this point the compression result ought to feel almost obvious. If the KV cache were an irreducible semantic record, aggressive compression should be brittle — tiny distortions should break exactly the bits the model needs, and shaving bits should cost you immediately in behavior. That does not look like the world we live in. Quantization, sparsification, truncation, “random guess” neighborhoods that stay task-rich, late-layer intervention energy staying concentrated in a comparatively narrow slice of the stack — compression keeps working. In my own activation work the hidden state is not an incompressible dense tensor either; it is spectrally structured, and a fairly small frequency-domain summary can preserve meaningful behavioral geometry after you throw away most of the raw tensor. That pattern should have been a warning sign about how redundant and basis-robust the internal state really is.
The TurboQuant result makes the same point from a different angle. Their first move is not to protect some sacred coordinate system. It is to rotate the vectors before quantization so the geometry becomes easier to compress coordinatewise. That is an extremely revealing choice. A random rotation is only this useful if the signal is not attached to one fragile canonical basis. The method works because what matters survives geometric remapping. PolarQuant then goes further by separating radius from angle, which is another way of saying that the useful content of the vector is structured enough to survive a rather dramatic change of coordinates.
That is exactly what you should expect if the cached object is a multiplexed residual state rather than a minimal semantic atom. Some of what gets stored is low-frequency or cross-layer structure, some is output-aligned direction, some is local repair work whose exact coordinates are negotiable as long as the downstream effect survives, and some lives in latent scaffolding that is query- or phase-conditional instead of universally necessary. Once you see that, compression successes stop looking miraculous — they read like outsiders noticing that the architecture has been preserving a richly structured object in an unnecessarily literal form.
That also helps with a tension that otherwise looks confusing. The model is touchy about late-layer geometry — small interventions in the wrong place can do a lot of damage — and yet the cache often tolerates far harsher compression than you would expect if every coordinate were equally precious. Both can be true because the decisive geometry is concentrated and structured, not smeared uniformly across the raw tensor; the problem is not “no important geometry,” it is important geometry braided together with contingent transport residue. Which is why I treat compressibility as a diagnostic, not just a systems trophy.
When a model tolerates spectral truncation, rotational remapping, low-bit quantization, and support sparsity far better than a naive dense-state story would predict, it is telling you the cache is not the smallest faithful record of what the model knows — it is a swollen transport artifact that still has enough regularity for heuristics to recover the signal. Put differently, compression works so well because the cache carries too much of the wrong kind of information, not because the right information was absent. That brings us back to the question that keeps orbiting. If the stored state is structured, redundant, and basis-robust, why are we still building models as if the only move is to transact everything through one additive residual medium? There is an architectural answer to that, but it helps to be clear first about what a transformer model is actually for.
LLMs Are Compression
I do not think the right reading of Yann LeCun’s position is that language models learn nothing, or that they are empty surface machinery. His own February 2026 paper with Hai Huang and Randall Balestriero, Semantic Tube Prediction, states the geometric claim directly. Token sequences “trace geodesics on a smooth semantic manifold.” The paper then argues that constraining hidden-state trajectories to a tube around that path can improve signal-to-noise and reduce trajectory collisions. That is not a story about hollow imitation. It is a story about real internal structure.
So I am not disagreeing with the “there is real geometry inside” story. The stronger clarification, for me, is that language models are not the whole system — but neither are they fake — they are compression, and more specifically they are crystals.
What they hold is not the whole physical world in any direct sense — it is a compressed model of mind as it has been entangled with understanding: concepts, roles, analogies, stories, explanations, symbolic habits, social regularities, and the countless ways humans have learned to carve the world into usable distinctions. When the residuals decode into organized bags of related words, I do not see an embarrassment. I see facets.
That is also why these bags are not merely bags. In categorical language, they are closer to coends than to dictionary entries. A concept inside the model is not one rigid atom with one privileged presentation. It is a glued object assembled from many local uses and many contextual transformations, with the unstable details quotiented away and the transportable structure retained. The model learns what survives recombination. It learns which distinctions remain meaningful after repeated passage through paraphrase, analogy, syntax, task, and explanation. That is already a kind of world model, but it is the world of mind.
This is why I do not buy the easy dismissal that LLMs are just stochastic parrots. Parrots do not induce semantically coherent manifolds of transport. They do not produce hidden states that remain so structured under spectral truncation, rotational remapping, and aggressive quantization. They do not populate local neighborhoods with dense families of nearby specialists. Something real is there.
Grant that, and the architectural indictment gets sharper in that the transformer took a real compressed cognitive object and then made it do far too much additional labor. It asked the same crystal to serve as semantic store, routing substrate, scratchpad, cleanup medium, and final output actuator. It made the same residual bus carry both the crystal and the machining debris produced while working on it.
That is why these systems can be so absurdly large and still feel wasteful. If one wants to say they are 90% too large, I think that is substantially closer to the truth than the claim that they contain nothing of value. The excess is not evidence of vacuity. It is evidence of architectural misuse. The model may be carrying something real, but it is carrying it in a form burdened by contamination, redundancy, and badly factored control.
So no, I do not think LLMs are the end of the road. But neither do I think they are a dead end. They are the first large compression crystals of mind we have managed to grow. The mistake is not believing in the crystal. The mistake is continuing to use it as the whole machine.
What a less wasteful architecture looks like is not a question I will get into here. The general shape is legible — separate channels for semantics and control, explicit memory distinct from transport, mechanisms that permit clean state separation rather than additive superposition — but that is a different post. The point for now is narrower. The transformer reached its empirical ceiling by building the largest possible compression object inside the most wasteful possible bus. A cleaner bus does not obviously require a smaller object.
References
-
Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, et al. “Circuit Tracing: Revealing Computational Graphs in Language Models.” Anthropic / Transformer Circuits, March 27, 2025. https://transformer-circuits.pub/2025/attribution-graphs/methods.html
-
Hai Huang, Yann LeCun, and Randall Balestriero. “Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA.” arXiv:2602.22617, February 26, 2026. https://arxiv.org/abs/2602.22617
-
Yulu Gan and Phillip Isola. “Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights.” arXiv:2603.12228, March 12, 2026. https://arxiv.org/abs/2603.12228
-
Google Research. “TurboQuant: Redefining AI Efficiency with Extreme Compression.” March 24, 2026. https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
-
Biao Zhang and Rico Sennrich. “Root Mean Square Layer Normalization.” arXiv:1910.07467, October 16, 2019. https://arxiv.org/abs/1910.07467
-
Chen Fan, Mark Schmidt, and Christos Thrampoulidis. “Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data.” arXiv:2502.04664, February 7, 2025 (revised December 5, 2025). https://arxiv.org/abs/2502.04664. Muon appears there as a spectral-norm steepest-descent special case; for the orthogonalized-momentum recipe and Newton–Schulz polar iteration used in training codes (including PyTorch’s
Muon), see Keller Jordan, “Muon: An optimizer for hidden layers in neural networks,” blog post, 2024. https://kellerjordan.github.io/posts/muon/ -
Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. “TIES-Merging: Resolving Interference When Merging Models.” arXiv:2306.01708, June 2, 2023; revised October 27, 2023. https://arxiv.org/abs/2306.01708