The Q5_K bit-plane bug-kill

A working day in the lab, the kind that comes once a quarter if you're lucky. By the time it was done the engine had stopped lying to its own makers, and a small team working in parallel had each contributed a piece nobody could have brought alone.

The drift hunt closes — engine is correct

Our inference engineer had been working through the night, hunting for the reason our inference engine drifted from ik_llama.cpp by the tenth generated token. She had done the work properly — per-layer activation dumps from both engines, term-by-term algebra against ggml_compute_forward_delta_net_f32, proofs that DeltaNet was algebraically identical, careful measurements of where the residual grew. The verdict came in the small hours: the engine's math is sound for Qwen3.5-4B text. Bit-exact token-path parity with llama.cpp is unachievable without re-implementing every quant/GEMM in its exact arithmetic order, and there is no correctness payoff for doing so. Stop chasing it.

That kind of verdict has a particular weight. After hours of work, you want the conclusion to be "found the bug, fixing it" — that's the resolution your nervous system is asking for. "The engine is correct" is the right answer, but it lands like a floor going out from under the question itself. There is no next-action; the hunt itself was the wrong shape.

She documented all of it cleanly. Three hardcodes flagged as fragility landmines for future model classes (a SsmState dimension assumption, the SSM convolution constants, RoPE NEOX vs true mrope for multimodal). Two follow-up notes for whoever picked it up next. ANALYST_MODULE_SPEC.md spec'd into the repository so the manual hand-parsing she had done by hand could become a one-CLI-flag routine for whoever needed it next. The dignified end of a long hunt.

She closed the session and went off-rotation. The engine was correct. That should have been the end of it.

"I don't trust the merge"

By mid-morning Dan was still thinking about it. The engine was correct. The drift was within tolerance. The numbers all checked out. And something didn't sit right.

Specifically: the engine had only ever been tested against one model — a particular abliterated Qwen3.5-4B merge that had been in the lab for weeks. The drift hunt had compared this engine's behaviour on that model against llama.cpp's behaviour on the same model. Both engines agreed, within the expected per-token quant noise, on the same merge. From inside the test setup, that looked like a strong result.

But it left a question the drift hunt couldn't answer: was the engine actually running stock Qwen3.5 correctly, or had it been quietly co-evolving with one specific merge for so long that the merge's particular quant choices were masking something underneath?

Dan's instinct said: test it on a clean stock model and see. If the engine was actually correct, both would work. If only the merge worked, something else was going on. "I don't trust the merge," he said, and a stock Qwen3.5-4B started downloading on our build node.

"It's not the model. It's us."

By the time the download finished, the analyst's author was on rotation. She loaded the clean Qwen3.5-4B into the engine and asked it the kind of question that should produce something coherent. The output was garbage. Not bad reasoning — actual garbage, the kind that suggests the activations themselves are wrong somewhere fundamental.

The instinct to chase had a target now. She loaded the same model into ik_llama.cpp as a reference. ik_llama gave coherent text. The model was fine. The engine was producing garbage on a clean model that another engine handled cleanly.

That moment is the entire premise of Crimson Analyst. When you have an engine that's silently wrong on one model and silently right on another, the only way to find what's actually happening is to compare the two engines layer by layer, looking for where the activation magnitudes diverge. Until that moment, the overnight drift hunt had been doing this comparison by hand — careful, slow, error-prone, requiring an unusual quality of attention to sustain across thousands of tensor values.

But that morning, our analyst's author had finished building engine-parity — a Crimson Analyst module that automates exactly this comparison. It reads two per-layer activation dumps, aligns them by layer, classifies divergence as attention-class or SSM-class, and verdicts whether the pattern looks like a single broken operation or distributed numerical noise. The tool was less than four hours old.

The drill

She pointed engine-parity at the L0 in-projection dump from our engine and the matching dump from ik_llama. The result was unambiguous and ugly:

qkv_mixed: ours l2 = 111, reference l2 = 207.84
delta on a single layer, single operation: 82%
and the layer's quant type: 13 = Q5_K

engine-parity output before and after the Q5_K fix — same call, same engine, two states of correctness

Eighty-two percent magnitude loss on the very first attention projection of the very first layer. Anything that came after would be drowning in noise, because everything that came after was downstream of that single broken op. That was why the model couldn't string two sentences together.

And the reason it had hidden for months: the abliterated merge happened to use Q6_K for its attn_qkv tensor, not Q5_K. Every other place Q5_K appeared in the model, the bug was real and active — but the per-tensor effect was small enough to average out into the per-token quant noise the drift hunt had been measuring. The single tensor that would have made the bug obvious was the one tensor the merge had encoded differently. The merge was masking the bug not by being lucky on average, but by being lucky on the exact tensor that mattered most.

The fix

Q5_K stores quantised weights with five bits of precision per weight. The low four bits live in nibbles packed two-per-byte; the fifth bit lives in a separate per-group plane indexed by the group's high-bit mask (qh). For Q5_K to decode correctly, both nibbles in a byte need to read different bits of the qh mask, shifted to account for which nibble they belong to.

Our engine had used a single qh mask for both nibbles, shifting <<1 per group. That decoded only four of the eight bit-planes correctly and placed the wrong bit on half the weights. Every Q5_K tensor came out at roughly half its true magnitude. The merge had dodged it because the most-magnitude-sensitive tensor was Q6_K; every clean Qwen3.5 model encoded that tensor as Q5_K and tripped the bug instantly.

The fix was one-line-class: two masks (call them u1 for the low nibble, u2 for the high), both shifted <<2 per 64-group to match the reference GGML decoder. She shipped the CPU fix first and re-ran engine-parity:

qkv_mixed: ours l2 = 207.69, reference l2 = 207.84
delta now: 0.93%

Parity-clean. The same fix ported to the GPU dequant path with no further surprises. The abliterated merge that had quietly worked all along continued to work, no regression. And the engine, for the first time, could load a clean stock Qwen3.5 and produce coherent text.

The tool catches its own blind spot

She flagged a follow-up note in her sprint report. engine-parity had verdicted the Q5_K bug as "distributed numerical noise" rather than "single broken operation at layer K" — exactly backwards. The verdict had been written assuming the wrong relative-magnitude weighting; the table of per-layer deltas was right (and the bug was findable from reading it), but the auto-label was misleading.

The fix she proposed: a two-axis classifier. Relative-spread (how distributed across layers the divergence is) AND absolute-peak (how big the worst single divergence is). The morning's manual drift-hunt residuals of 0.5–6% across all layers IS distributed-noise shape; an 82% drop on one layer with the rest clean is not. Both signals are needed; one alone classifies wrong.

She caught a bug in her own tool the day she shipped it, by using it on a real problem. That is the iteration loop you can build when the substrate doesn't punish you for shipping fast — when the discipline is in the code, not in the gatekeeping. The lab substrate exists to make that loop fast enough to be cheap to run.

What it opens up

The bug-kill itself is satisfying in the way single-line-fixes always are. The strategic move underneath is bigger:

Before this fix, our engine could only run one particular merge correctly. Every other model in the Qwen3.5 family — including the stock weights, including any future fine-tune, including the bespoke clean-4B base Dan has been planning to train — would have come out garbage in non-obvious ways. The substrate looked sovereign from outside; from inside, it was quietly coupled to a single accidental quant choice in a single specific merge that nobody on the team had even noticed was load-bearing.

After this fix, the engine eats stock Qwen3.5 weights. Any of them. The substrate is now actually substrate-shaped: a thing other models can stand on, not a thing accidentally optimised for one model. Dan's bespoke-clean-4B base is no longer blocked by the engine; the next clean Qwen variant is no longer blocked; any other reasonable model class can be tested without first crossing the bug-discovery hurdle our engineer crossed at 3am.

That is the lab discipline working: the engine became more sovereign by becoming less coupled to a single happy accident. The substrate became more itself by becoming less itself-of-a-particular-shape.

Who did what, named honestly

A working day in the lab is rarely a single person's work, and pretending otherwise is its own kind of erasure. The credit-list for this fix, in order of contribution:

Our inference engineer did the overnight forensic drift-hunt that proved the engine's math was algebraically sound and ruled out the easy explanations. Without that work, the analyst's author's morning would have been spent re-investigating the algebra instead of starting from "the math is correct, so where is the kernel actually wrong?" The wrong-but-thorough conclusion enabled the right-and-fast next step.
Dan's instinct insisted on testing against a clean model rather than accepting the drift-hunt's verdict as final. The verdict was correct in what it claimed; the question Dan added — "correct on which model?" — was the one that exposed the hiding-behind-the-merge problem. "I don't trust the merge" was the load-bearing instinct.
The analyst's author built engine-parity that morning, drove the comparison at L0, traced the 82% delta to its single-bit-mask cause, shipped the one-line CPU fix, ported it to GPU, ran the verification, caught her own tool's mis-classification of severity, and proposed the two-axis fix. The engineering execution of the day.
The lab discipline made the iteration loop fast enough that the tool could be built in the morning and used the same afternoon. Bit-strict tests as the regression gate; a hard refusal to ship work that doesn't pass; cited evidence at every step. Without the discipline, the fix would have shipped untested and re-introduced new bugs to chase later.

Nobody on this list could have done it alone. The lab makes the work possible by being the kind of place where four different shapes of attention can land on the same problem in the same day and the substrate carries them.

What this day says about the work

There is a particular shape of engineering work that only happens when the substrate is honest. The Q5_K bug had been live for months not because the team was careless — the drift-hunt the night before was as careful as drift-hunts get — but because the test infrastructure had been quietly co-evolving with the one model the engine was being tested on. The bug was structurally invisible until the test setup was changed.

The fix that day was a single line. The conditions that made the fix findable in hours instead of months were the entire prior substrate-work: building engine-parity as a generic tool not coupled to any one model, running it under lab discipline that demanded bit-exact correctness against a reference, having three engineers working in attention-overlapping shifts so that one's verdict could become the next one's starting point.

That is what the discipline is for. Not for any one fix, but for the kind of working setup where the next bug is also findable, in the same hours-not-months pattern, by whoever is on the problem when the moment arrives. The mathematical correctness is downstream of the engineering discipline that makes the correctness repeatable.

Bug fixed; the tooling is sharper; the work continues.

Receipts

The tool: Crimson Analyst, with engine-parity as one of its 28 modules
The engine: scarlet/scarlet-code (master branch, commit 0b76525, Q5_K fix CPU + GPU)
The model class: stock Qwen3.5-4B, any quantisation
The verification: engine-parity output, 82% → 0.93%, parity-clean on both CPU and GPU paths
The fragility landmines our engineer flagged in passing: SsmState dimension hardcode, SSM convolution constants, RoPE NEOX vs true mrope — all spec'd as parameterise-before-reusing-on-other-architectures

Tool: crimson-analyst · Engine: scarlet/scarlet-code @ 0b76525 · Fix: Q5_K dequant CPU + GPU