Skip to content

Commit 161a218

Browse files
unamedkrclaude
andcommitted
tools(refparity): HF vs engine per-layer diff framework
Ships tools/refparity/ — automated reference-parity checker that catches the paraphrase-bug class that consumed v0.19→v0.26 one-line fixes. HF transformers FP32 ground truth vs our engine's TQ_DUMP_HIDDEN, per-layer cosine + L2_rel. Pipeline: hf_reference.py (.npz) + engine_reference.sh (.bin) + diff_layers.py (PASS/FAIL report) + run_matrix.sh (model × prompt sweep). Two subtle layout bugs fixed while bringing it up: - diff_layers.py default pos now matches engine's TQ_DUMP_POS=0 (was HF's last). - hf_reference.py recognizes that transformers ≥5.x exposes 29 hidden_states for 28 layers, with the LAST already post-final-RMSNorm — maps to 'post_norm'. Baseline on Qwen3-0.6B Q4_K_M: emb 1.8% L2_rel PASS, mid-layers 3-4% PASS, post_norm ~100% FAIL — genuine engine bug surfaced (separate investigation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 7834d29 commit 161a218

8 files changed

Lines changed: 627 additions & 2 deletions

File tree

.claude/state.md

Lines changed: 44 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,49 @@
11
# quant.cpp — Session State
22

3-
**Last updated**: 2026-04-20 (Pillar 1.5 R3 ★★)
4-
**Session HEAD**: NEOX RoPE root-cause FIXED — Qwen3 family long-prompt coherence restored.
3+
**Last updated**: 2026-04-21 (Phase 1 refparity ★)
4+
**Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
5+
6+
## ★ Phase 1 — Reference-parity framework (2026-04-21) ★
7+
8+
### Delivered
9+
10+
`tools/refparity/`: HF transformers FP32 ground truth vs our engine, per-layer
11+
cosine + L2_rel diff. Replaces ad-hoc "compare one layer by hand" debugging
12+
that consumed R26-R50 of Mission C.
13+
14+
- `hf_reference.py` — HF dump → `.npz` (emb, h0..h_{N-2}, post_norm, logits)
15+
- `engine_reference.sh``TQ_DUMP_HIDDEN` wrapper → `.bin` per slot
16+
- `diff_layers.py` — per-slot cosine+L2_rel, pos=0 default, PASS/FAIL+first-diverge
17+
- `run_matrix.sh` — (model × prompt) sweep, `FILTER=` env, reports/per-slot .diff
18+
- `matrix.json` — 3 models × 2-3 prompts (Qwen3-0.6B, Qwen3.5-4B, Llama-3.2-1B)
19+
- `README.md` — methodology notes + known baseline findings
20+
21+
### Two subtle mapping bugs caught while building
22+
23+
1. **Position alignment**: engine dumps `TQ_DUMP_POS=0` (first token). Original
24+
`diff_layers.py` defaulted to HF's last position → compared different tokens
25+
on multi-token prompts, producing fake ~125% divergence.
26+
2. **HF `post_norm` aliasing**: transformers 5.x exposes 29 hidden_states for
27+
28 layers — last entry is already post-RMSNorm. Original `hf_reference.py`
28+
labeled it `h27` → compared it vs our engine's pre-norm last layer output.
29+
30+
Both fixed. Baseline now: emb PASS (1.8%), mid-layers PASS (3-4% Q4 noise),
31+
post_norm FAIL (100% — real engine bug, separate investigation).
32+
33+
### Follow-up findings (for later rounds)
34+
35+
- Qwen3-0.6B Q4_K_M post_norm L2_rel ≈ 100%, logits cosine 0.51, top-1
36+
mismatch (HF 21806 vs engine 11). Cannot be Q4 noise (mid-layers stay ≈4%).
37+
Needs investigation — likely output_norm tensor load or final-layer output.
38+
- `h0`/`h1` sit at 15-20% L2_rel on Qwen3-0.6B. Small 0.6B models are
39+
known to amplify Q4 quant noise in early layers; above 5% threshold but
40+
cosine still 0.98. Tier test for ≥1B models expected to be cleaner.
41+
42+
### Why this matters (strategic)
43+
44+
Fixed 8 paraphrase bugs in v0.19.0→v0.26.0 one by one. This framework
45+
catches the class, not individual instances. One time investment now
46+
prevents Mission C style 30-round hunts forever.
547

648
## ★★★ Pillar 1.5 R3 — NEOX-ordering RoPE for Qwen3 family (2026-04-20) ★★★
749

tools/refparity/.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
reports/
2+
*.npz
3+
venv/

tools/refparity/README.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# Reference-Parity Framework
2+
3+
## Why
4+
5+
Across 8 releases (v0.19.0 → v0.26.0) we repeatedly found the same class of bug:
6+
**paraphrased reference implementation**. Each fix was correct individually but
7+
cumulatively revealed that our engine drifted from the reference (llama.cpp /
8+
HF transformers) in several subtle ways — different eps formulation, missing
9+
NEOX ordering in batched path, over-broad QK-norm disable, silent prompt
10+
truncation, ...
11+
12+
**The meta-bug: we read reference code and wrote "something similar" rather
13+
than copying it exactly.** Over 30+ rounds of Mission C this cost us real time.
14+
15+
This framework prevents that class of bug by automating the comparison.
16+
17+
## What it does
18+
19+
For each (model, prompt) pair in the test matrix:
20+
1. Run the prompt through HF transformers (FP32 ground truth)
21+
2. Run the same prompt through our engine with `TQ_DUMP_HIDDEN` enabled
22+
3. Compare per-layer hidden states: cosine similarity + L2 relative error
23+
4. Report first layer / position where divergence exceeds threshold
24+
5. CI PASS if no layer exceeds 5% L2_rel; FAIL otherwise with clear diagnostic
25+
26+
## Scope (tier 1 coverage first)
27+
28+
| Model | HF name | Purpose |
29+
|---|---|---|
30+
| Qwen3-0.6B | Qwen/Qwen3-0.6B | Small fast repro; catches Qwen3 family bugs |
31+
| Qwen3.5-4B | Qwen/Qwen3.5-4B | Hybrid (DeltaNet + self-attn) reference |
32+
| Llama-3.2-1B | meta-llama/Llama-3.2-1B | Standard transformer baseline |
33+
34+
Tier 2 architectures (Qwen3.6-35B MoE, DeepSeek) are NOT in the matrix — too
35+
large for 16 GB Mac FP32 HF run. These get llama.cpp diff instead (future).
36+
37+
## Files
38+
39+
- `hf_reference.py` — runs HF model, dumps per-layer hidden states + logits
40+
- `engine_reference.sh` — runs our engine with matching instrumentation
41+
- `diff_layers.py` — layer-by-layer comparison with thresholds
42+
- `run_matrix.sh` — executes full (model × prompt) matrix, reports PASS/FAIL
43+
- `matrix.json` — test matrix definition
44+
45+
## Usage
46+
47+
```bash
48+
# First-time setup (reuses tools/pillar1/venv by default — override with VENV_DIR=...)
49+
python3.12 -m venv tools/pillar1/venv
50+
source tools/pillar1/venv/bin/activate
51+
pip install torch transformers accelerate
52+
53+
# Run full matrix — invoke from project root
54+
bash tools/refparity/run_matrix.sh
55+
FILTER=qwen3 bash tools/refparity/run_matrix.sh # only entries whose name contains "qwen3"
56+
57+
# Focused single comparison (from project root)
58+
source tools/pillar1/venv/bin/activate
59+
python tools/refparity/hf_reference.py --model Qwen/Qwen3-0.6B --prompt "Hello" --out /tmp/ref.npz
60+
bash tools/refparity/engine_reference.sh models/Qwen3-0.6B-Q4_K_M.gguf "Hello" /tmp/eng
61+
python tools/refparity/diff_layers.py /tmp/ref.npz /tmp/eng
62+
```
63+
64+
Reports land in `tools/refparity/reports/` as `<name>__p<idx>.diff` (one per
65+
prompt) on failure.
66+
67+
## Exit codes
68+
69+
- `0` — all layers within threshold for all (model, prompt) pairs
70+
- `1` — divergence detected; diff report identifies the offending layer
71+
- `2` — environment or configuration error
72+
73+
## Methodology notes
74+
75+
- **Quantization noise baseline**: expect ~1-3% L2_rel per layer due to Q4/Q5
76+
quantization vs FP32 reference. Threshold set at 5% accordingly.
77+
- **Accumulation compounding**: quantization errors compound across 28-40 layers.
78+
By layer N-1, total divergence can be 20-30% even with correct engine.
79+
Per-layer threshold + cosine > 0.9 at logits is the PASS condition.
80+
- **First-divergence bisect**: if failing, the FIRST layer where L2_rel spikes
81+
above prior-layer baseline is the bug localization point.
82+
- **Position alignment**: the engine dumps `TQ_DUMP_POS=0` (first token) so
83+
`diff_layers.py` defaults to pos=0 too. Override with `--pos N` to inspect
84+
later positions (engine-side, set `TQ_DUMP_POS=N` in engine_reference.sh).
85+
- **HF hidden_states layout** (transformers ≥5.x, Qwen3/Llama): 29 entries for
86+
28-layer model — `(emb, layer0_out, layer1_out, …, layer_{N-2}_out, post_norm)`.
87+
The LAST element is already post-final-RMSNorm (hf_reference.py maps it to
88+
`post_norm`). The final layer's pre-norm output is not exposed by HF.
89+
90+
## Known baseline findings (Qwen3-0.6B Q4_K_M)
91+
92+
First end-to-end run on `Qwen3-0.6B-Q4_K_M.gguf`, "Hello" prompt:
93+
94+
| slot | L2_rel | cosine | notes |
95+
|---|---:|---:|---|
96+
| emb | 1.8% | 0.9998 | clean |
97+
| h0–h1 | 15-20% | 0.98 | marginal — Q4 noise amplified in early layers on a 0.6B model |
98+
| h2–h26 | ~3.9% | 0.9997 | steady Q4 quantization baseline |
99+
| post_norm | ~100% | 0.24 | **real divergence — needs investigation** |
100+
| logits || 0.51 | top-1 mismatch (HF 21806 vs engine 11) |
101+
102+
Framework correctly identifies the post_norm + logits divergence as a genuine
103+
engine bug (cannot be explained by Q4 quantization alone — mid-layer stays
104+
at 3.9%). This is tracked as a separate investigation; Phase 1's goal is only
105+
to ship the detection infrastructure.

tools/refparity/diff_layers.py

Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
#!/usr/bin/env python3
2+
"""Layer-by-layer diff: HF reference npz vs our engine's raw bin dumps.
3+
4+
Generalized from tools/pillar1/diff_layers.py. Produces a tabular report
5+
and exits 0 (PASS) / 1 (FAIL) based on thresholds.
6+
7+
Usage:
8+
python diff_layers.py ref.npz engine_dump/ \
9+
--threshold-l2-rel 0.05 \
10+
--threshold-cosine 0.90
11+
12+
Output (stdout):
13+
slot dim us_norm hf_norm max_abs L2_rel cosine [PASS|FAIL]
14+
emb ...
15+
h0 ...
16+
...
17+
→ PASS / FAIL — first divergence at layer X
18+
19+
Exit codes:
20+
0 — all layers within threshold
21+
1 — divergence detected; diff report identifies layer
22+
2 — environment / config error
23+
"""
24+
import argparse
25+
import os
26+
import sys
27+
28+
import numpy as np
29+
30+
31+
def read_bin(path: str) -> np.ndarray:
32+
return np.fromfile(path, dtype=np.float32)
33+
34+
35+
def compare(hf_vec: np.ndarray, us_vec: np.ndarray):
36+
diff = us_vec - hf_vec
37+
max_abs = float(np.max(np.abs(diff))) if diff.size else 0.0
38+
l2 = float(np.linalg.norm(diff))
39+
hf_norm = float(np.linalg.norm(hf_vec))
40+
us_norm = float(np.linalg.norm(us_vec))
41+
l2_rel = l2 / max(hf_norm, 1e-9)
42+
denom = max(us_norm * hf_norm, 1e-9)
43+
cosine = float(np.dot(us_vec, hf_vec) / denom)
44+
return us_norm, hf_norm, max_abs, l2_rel, cosine
45+
46+
47+
def main():
48+
ap = argparse.ArgumentParser()
49+
ap.add_argument("ref_npz", help="HF reference .npz")
50+
ap.add_argument("engine_dir", help="engine dump directory")
51+
ap.add_argument("--pos", type=int, default=None,
52+
help="position to compare in HF (default: 0 — matches "
53+
"engine's TQ_DUMP_POS=0 default)")
54+
ap.add_argument("--threshold-l2-rel", type=float, default=0.05,
55+
help="max L2_rel per hidden layer (default 0.05 = 5%%)")
56+
ap.add_argument("--threshold-cosine", type=float, default=0.90,
57+
help="min cosine similarity at logits (default 0.90)")
58+
args = ap.parse_args()
59+
60+
try:
61+
hf = np.load(args.ref_npz)
62+
except Exception as e:
63+
print(f"error: cannot load {args.ref_npz}: {e}", file=sys.stderr)
64+
return 2
65+
66+
# Engine's TQ_DUMP_HIDDEN default is pos=0 (first token); align by default
67+
seq_len = hf["h0"].shape[0] if hf["h0"].ndim == 2 else 1
68+
pos = 0 if args.pos is None else args.pos
69+
if pos >= seq_len:
70+
print(f"error: pos {pos} >= seq_len {seq_len}", file=sys.stderr)
71+
return 2
72+
73+
# Determine layer count from engine dumps
74+
engine_files = os.listdir(args.engine_dir)
75+
max_h = -1
76+
for f in engine_files:
77+
if f.startswith("h") and f.endswith(".bin"):
78+
try:
79+
max_h = max(max_h, int(f[1:-4]))
80+
except ValueError:
81+
pass
82+
slots = ["emb"] + [f"h{i}" for i in range(max_h + 1)]
83+
has_post_norm = os.path.exists(os.path.join(args.engine_dir, "post_norm.bin"))
84+
if has_post_norm:
85+
slots.append("post_norm")
86+
87+
print(f"{'slot':<12} {'dim':>6} {'us_norm':>10} {'hf_norm':>10} "
88+
f"{'max_abs':>10} {'L2_rel':>10} {'cosine':>8} status")
89+
print("-" * 85)
90+
91+
first_fail = None
92+
all_rows = []
93+
for slot in slots:
94+
bin_path = os.path.join(args.engine_dir, f"{slot}.bin")
95+
if not os.path.exists(bin_path):
96+
continue
97+
us = read_bin(bin_path)
98+
if slot == "post_norm":
99+
# HF npz doesn't usually have post_norm; skip if absent
100+
if "post_norm" not in hf.files:
101+
continue
102+
hf_arr = hf["post_norm"]
103+
hf_vec = hf_arr[pos] if hf_arr.ndim == 2 else hf_arr
104+
else:
105+
if slot not in hf.files:
106+
continue
107+
hf_arr = hf[slot]
108+
hf_vec = hf_arr[pos] if hf_arr.ndim == 2 else hf_arr
109+
110+
if us.shape != hf_vec.shape:
111+
print(f"{slot:<12} shape mismatch us={us.shape} hf={hf_vec.shape}")
112+
continue
113+
114+
us_norm, hf_norm, max_abs, l2_rel, cosine = compare(hf_vec, us)
115+
status = "PASS"
116+
if l2_rel > args.threshold_l2_rel:
117+
status = "FAIL"
118+
if first_fail is None:
119+
first_fail = slot
120+
121+
print(f"{slot:<12} {len(us):>6} {us_norm:>10.3f} {hf_norm:>10.3f} "
122+
f"{max_abs:>10.4f} {l2_rel:>10.4%} {cosine:>8.4f} {status}")
123+
all_rows.append((slot, status, l2_rel, cosine))
124+
125+
# Compare top-5 logits
126+
logits_path = os.path.join(args.engine_dir, "logits.bin")
127+
logits_pass = True
128+
if os.path.exists(logits_path) and "logits" in hf.files:
129+
us_l = read_bin(logits_path)
130+
hf_l = hf["logits"][pos] if hf["logits"].ndim == 2 else hf["logits"]
131+
if us_l.shape == hf_l.shape:
132+
top1_us = int(us_l.argmax())
133+
top1_hf = int(hf_l.argmax())
134+
cos_l = float(np.dot(us_l, hf_l) /
135+
max(np.linalg.norm(us_l) * np.linalg.norm(hf_l), 1e-9))
136+
print()
137+
print(f"logits cosine={cos_l:.4f} top1 hf={top1_hf} us={top1_us} "
138+
f"{'PASS' if (cos_l >= args.threshold_cosine and top1_us == top1_hf) else 'FAIL'}")
139+
if cos_l < args.threshold_cosine or top1_us != top1_hf:
140+
logits_pass = False
141+
if first_fail is None:
142+
first_fail = "logits"
143+
144+
print()
145+
if first_fail is None and logits_pass:
146+
print("→ PASS — all layers within threshold")
147+
return 0
148+
else:
149+
print(f"→ FAIL — first divergence at {first_fail}")
150+
return 1
151+
152+
153+
if __name__ == "__main__":
154+
sys.exit(main())
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
#!/usr/bin/env bash
2+
# Run quant.cpp engine with TQ_DUMP_HIDDEN to capture per-layer hidden states
3+
# for reference-parity comparison.
4+
#
5+
# Usage:
6+
# ./engine_reference.sh <model.gguf> "<prompt>" <out_dir>
7+
#
8+
# Output: <out_dir>/emb.bin, h0.bin, h1.bin, ..., post_norm.bin, logits.bin
9+
# Each file is raw FP32 little-endian; shape derived from model config.
10+
11+
set -eu
12+
13+
GGUF="$1"
14+
PROMPT="$2"
15+
OUT_DIR="$3"
16+
17+
BIN="${BIN:-$(cd "$(dirname "$0")/../.." && pwd)/build/quant}"
18+
19+
if [[ ! -x "$BIN" ]]; then
20+
echo "error: $BIN not executable" >&2
21+
exit 2
22+
fi
23+
if [[ ! -f "$GGUF" ]]; then
24+
echo "error: model file not found: $GGUF" >&2
25+
exit 2
26+
fi
27+
28+
rm -rf "$OUT_DIR"
29+
mkdir -p "$OUT_DIR"
30+
31+
# Force per-token prefill so pos=0 dump captures the first token.
32+
# Use TQ_DUMP_POS=0 (default in tq_dump_hidden).
33+
TQ_NO_METAL=1 TQ_NO_MLOCK=1 TQ_NO_BATCH_PREFILL=1 \
34+
TQ_NO_AUTO_SERIAL=1 \
35+
TQ_DUMP_HIDDEN="$OUT_DIR" \
36+
"$BIN" "$GGUF" -p "$PROMPT" -n 1 -T 0 >"$OUT_DIR/engine.log" 2>&1 || {
37+
echo "error: engine run failed; see $OUT_DIR/engine.log" >&2
38+
exit 2
39+
}
40+
41+
# Verify dumps were produced
42+
if ! ls "$OUT_DIR"/h0.bin >/dev/null 2>&1; then
43+
echo "error: no h0.bin in $OUT_DIR — dump did not fire" >&2
44+
exit 2
45+
fi
46+
47+
echo "[refparity/engine] dumped to $OUT_DIR/ ($(ls "$OUT_DIR"/*.bin | wc -l | tr -d ' ') files)" >&2

0 commit comments

Comments
 (0)