quantumaikr
diff --git a/‎.claude/state.md‎
Lines changed: 44 additions & 2 deletions b/‎.claude/state.md‎
Lines changed: 44 additions & 2 deletions
diff --git a/‎tools/refparity/.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎tools/refparity/.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎tools/refparity/README.md‎
Lines changed: 105 additions & 0 deletions b/‎tools/refparity/README.md‎
Lines changed: 105 additions & 0 deletions
diff --git a/‎tools/refparity/diff_layers.py‎
Lines changed: 154 additions & 0 deletions b/‎tools/refparity/diff_layers.py‎
Lines changed: 154 additions & 0 deletions
diff --git a/‎tools/refparity/engine_reference.sh‎
Lines changed: 47 additions & 0 deletions b/‎tools/refparity/engine_reference.sh‎
Lines changed: 47 additions & 0 deletions
@@ -1,7 +1,49 @@
 # quant.cpp — Session State
 
-**Last updated**: 2026-04-20 (Pillar 1.5 R3 ★★)
-**Session HEAD**: NEOX RoPE root-cause FIXED — Qwen3 family long-prompt coherence restored.
+**Last updated**: 2026-04-21 (Phase 1 refparity ★)
+**Session HEAD**: Reference-parity framework (tools/refparity/) LANDED — HF vs engine per-layer diff, pos-aligned, post_norm-aware.
+
+## ★ Phase 1 — Reference-parity framework (2026-04-21) ★
+
+### Delivered
+
+`tools/refparity/`: HF transformers FP32 ground truth vs our engine, per-layer
+cosine + L2_rel diff. Replaces ad-hoc "compare one layer by hand" debugging
+that consumed R26-R50 of Mission C.
+
+- `hf_reference.py` — HF dump → `.npz` (emb, h0..h_{N-2}, post_norm, logits)
+- `engine_reference.sh` — `TQ_DUMP_HIDDEN` wrapper → `.bin` per slot
+- `diff_layers.py` — per-slot cosine+L2_rel, pos=0 default, PASS/FAIL+first-diverge
+- `run_matrix.sh` — (model × prompt) sweep, `FILTER=` env, reports/per-slot .diff
+- `matrix.json` — 3 models × 2-3 prompts (Qwen3-0.6B, Qwen3.5-4B, Llama-3.2-1B)
+- `README.md` — methodology notes + known baseline findings
+
+### Two subtle mapping bugs caught while building
+
+1. **Position alignment**: engine dumps `TQ_DUMP_POS=0` (first token). Original
+   `diff_layers.py` defaulted to HF's last position → compared different tokens
+   on multi-token prompts, producing fake ~125% divergence.
+2. **HF `post_norm` aliasing**: transformers 5.x exposes 29 hidden_states for
+   28 layers — last entry is already post-RMSNorm. Original `hf_reference.py`
+   labeled it `h27` → compared it vs our engine's pre-norm last layer output.
+
+Both fixed. Baseline now: emb PASS (1.8%), mid-layers PASS (3-4% Q4 noise),
+post_norm FAIL (100% — real engine bug, separate investigation).
+
+### Follow-up findings (for later rounds)
+
+- Qwen3-0.6B Q4_K_M post_norm L2_rel ≈ 100%, logits cosine 0.51, top-1
+  mismatch (HF 21806 vs engine 11). Cannot be Q4 noise (mid-layers stay ≈4%).
+  Needs investigation — likely output_norm tensor load or final-layer output.
+- `h0`/`h1` sit at 15-20% L2_rel on Qwen3-0.6B. Small 0.6B models are
+  known to amplify Q4 quant noise in early layers; above 5% threshold but
+  cosine still 0.98. Tier test for ≥1B models expected to be cleaner.
+
+### Why this matters (strategic)
+
+Fixed 8 paraphrase bugs in v0.19.0→v0.26.0 one by one. This framework
+catches the class, not individual instances. One time investment now
+prevents Mission C style 30-round hunts forever.
 
 ## ★★★ Pillar 1.5 R3 — NEOX-ordering RoPE for Qwen3 family (2026-04-20) ★★★
 
 
@@ -0,0 +1,3 @@
+reports/
+*.npz
+venv/
@@ -0,0 +1,105 @@
+# Reference-Parity Framework
+
+## Why
+
+Across 8 releases (v0.19.0 → v0.26.0) we repeatedly found the same class of bug:
+**paraphrased reference implementation**. Each fix was correct individually but
+cumulatively revealed that our engine drifted from the reference (llama.cpp /
+HF transformers) in several subtle ways — different eps formulation, missing
+NEOX ordering in batched path, over-broad QK-norm disable, silent prompt
+truncation, ...
+
+**The meta-bug: we read reference code and wrote "something similar" rather
+than copying it exactly.** Over 30+ rounds of Mission C this cost us real time.
+
+This framework prevents that class of bug by automating the comparison.
+
+## What it does
+
+For each (model, prompt) pair in the test matrix:
+1. Run the prompt through HF transformers (FP32 ground truth)
+2. Run the same prompt through our engine with `TQ_DUMP_HIDDEN` enabled
+3. Compare per-layer hidden states: cosine similarity + L2 relative error
+4. Report first layer / position where divergence exceeds threshold
+5. CI PASS if no layer exceeds 5% L2_rel; FAIL otherwise with clear diagnostic
+
+## Scope (tier 1 coverage first)
+
+| Model | HF name | Purpose |
+|---|---|---|
+| Qwen3-0.6B | Qwen/Qwen3-0.6B | Small fast repro; catches Qwen3 family bugs |
+| Qwen3.5-4B | Qwen/Qwen3.5-4B | Hybrid (DeltaNet + self-attn) reference |
+| Llama-3.2-1B | meta-llama/Llama-3.2-1B | Standard transformer baseline |
+
+Tier 2 architectures (Qwen3.6-35B MoE, DeepSeek) are NOT in the matrix — too
+large for 16 GB Mac FP32 HF run. These get llama.cpp diff instead (future).
+
+## Files
+
+- `hf_reference.py` — runs HF model, dumps per-layer hidden states + logits
+- `engine_reference.sh` — runs our engine with matching instrumentation
+- `diff_layers.py` — layer-by-layer comparison with thresholds
+- `run_matrix.sh` — executes full (model × prompt) matrix, reports PASS/FAIL
+- `matrix.json` — test matrix definition
+
+## Usage
+
+```bash
+# First-time setup (reuses tools/pillar1/venv by default — override with VENV_DIR=...)
+python3.12 -m venv tools/pillar1/venv
+source tools/pillar1/venv/bin/activate
+pip install torch transformers accelerate
+
+# Run full matrix — invoke from project root
+bash tools/refparity/run_matrix.sh
+FILTER=qwen3 bash tools/refparity/run_matrix.sh     # only entries whose name contains "qwen3"
+
+# Focused single comparison (from project root)
+source tools/pillar1/venv/bin/activate
+python tools/refparity/hf_reference.py --model Qwen/Qwen3-0.6B --prompt "Hello" --out /tmp/ref.npz
+bash tools/refparity/engine_reference.sh models/Qwen3-0.6B-Q4_K_M.gguf "Hello" /tmp/eng
+python tools/refparity/diff_layers.py /tmp/ref.npz /tmp/eng
+```
+
+Reports land in `tools/refparity/reports/` as `<name>__p<idx>.diff` (one per
+prompt) on failure.
+
+## Exit codes
+
+- `0` — all layers within threshold for all (model, prompt) pairs
+- `1` — divergence detected; diff report identifies the offending layer
+- `2` — environment or configuration error
+
+## Methodology notes
+
+- **Quantization noise baseline**: expect ~1-3% L2_rel per layer due to Q4/Q5
+  quantization vs FP32 reference. Threshold set at 5% accordingly.
+- **Accumulation compounding**: quantization errors compound across 28-40 layers.
+  By layer N-1, total divergence can be 20-30% even with correct engine.
+  Per-layer threshold + cosine > 0.9 at logits is the PASS condition.
+- **First-divergence bisect**: if failing, the FIRST layer where L2_rel spikes
+  above prior-layer baseline is the bug localization point.
+- **Position alignment**: the engine dumps `TQ_DUMP_POS=0` (first token) so
+  `diff_layers.py` defaults to pos=0 too. Override with `--pos N` to inspect
+  later positions (engine-side, set `TQ_DUMP_POS=N` in engine_reference.sh).
+- **HF hidden_states layout** (transformers ≥5.x, Qwen3/Llama): 29 entries for
+  28-layer model — `(emb, layer0_out, layer1_out, …, layer_{N-2}_out, post_norm)`.
+  The LAST element is already post-final-RMSNorm (hf_reference.py maps it to
+  `post_norm`). The final layer's pre-norm output is not exposed by HF.
+
+## Known baseline findings (Qwen3-0.6B Q4_K_M)
+
+First end-to-end run on `Qwen3-0.6B-Q4_K_M.gguf`, "Hello" prompt:
+
+| slot | L2_rel | cosine | notes |
+|---|---:|---:|---|
+| emb | 1.8% | 0.9998 | clean |
+| h0–h1 | 15-20% | 0.98 | marginal — Q4 noise amplified in early layers on a 0.6B model |
+| h2–h26 | ~3.9% | 0.9997 | steady Q4 quantization baseline |
+| post_norm | ~100% | 0.24 | **real divergence — needs investigation** |
+| logits | — | 0.51 | top-1 mismatch (HF 21806 vs engine 11) |
+
+Framework correctly identifies the post_norm + logits divergence as a genuine
+engine bug (cannot be explained by Q4 quantization alone — mid-layer stays
+at 3.9%). This is tracked as a separate investigation; Phase 1's goal is only
+to ship the detection infrastructure.
@@ -0,0 +1,154 @@
+#!/usr/bin/env python3
+"""Layer-by-layer diff: HF reference npz vs our engine's raw bin dumps.
+
+Generalized from tools/pillar1/diff_layers.py. Produces a tabular report
+and exits 0 (PASS) / 1 (FAIL) based on thresholds.
+
+Usage:
+    python diff_layers.py ref.npz engine_dump/ \
+        --threshold-l2-rel 0.05 \
+        --threshold-cosine 0.90
+
+Output (stdout):
+    slot   dim   us_norm   hf_norm   max_abs   L2_rel   cosine  [PASS|FAIL]
+    emb    ...
+    h0     ...
+    ...
+    → PASS / FAIL — first divergence at layer X
+
+Exit codes:
+    0 — all layers within threshold
+    1 — divergence detected; diff report identifies layer
+    2 — environment / config error
+"""
+import argparse
+import os
+import sys
+
+import numpy as np
+
+
+def read_bin(path: str) -> np.ndarray:
+    return np.fromfile(path, dtype=np.float32)
+
+
+def compare(hf_vec: np.ndarray, us_vec: np.ndarray):
+    diff = us_vec - hf_vec
+    max_abs = float(np.max(np.abs(diff))) if diff.size else 0.0
+    l2 = float(np.linalg.norm(diff))
+    hf_norm = float(np.linalg.norm(hf_vec))
+    us_norm = float(np.linalg.norm(us_vec))
+    l2_rel = l2 / max(hf_norm, 1e-9)
+    denom = max(us_norm * hf_norm, 1e-9)
+    cosine = float(np.dot(us_vec, hf_vec) / denom)
+    return us_norm, hf_norm, max_abs, l2_rel, cosine
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("ref_npz", help="HF reference .npz")
+    ap.add_argument("engine_dir", help="engine dump directory")
+    ap.add_argument("--pos", type=int, default=None,
+                    help="position to compare in HF (default: 0 — matches "
+                         "engine's TQ_DUMP_POS=0 default)")
+    ap.add_argument("--threshold-l2-rel", type=float, default=0.05,
+                    help="max L2_rel per hidden layer (default 0.05 = 5%%)")
+    ap.add_argument("--threshold-cosine", type=float, default=0.90,
+                    help="min cosine similarity at logits (default 0.90)")
+    args = ap.parse_args()
+
+    try:
+        hf = np.load(args.ref_npz)
+    except Exception as e:
+        print(f"error: cannot load {args.ref_npz}: {e}", file=sys.stderr)
+        return 2
+
+    # Engine's TQ_DUMP_HIDDEN default is pos=0 (first token); align by default
+    seq_len = hf["h0"].shape[0] if hf["h0"].ndim == 2 else 1
+    pos = 0 if args.pos is None else args.pos
+    if pos >= seq_len:
+        print(f"error: pos {pos} >= seq_len {seq_len}", file=sys.stderr)
+        return 2
+
+    # Determine layer count from engine dumps
+    engine_files = os.listdir(args.engine_dir)
+    max_h = -1
+    for f in engine_files:
+        if f.startswith("h") and f.endswith(".bin"):
+            try:
+                max_h = max(max_h, int(f[1:-4]))
+            except ValueError:
+                pass
+    slots = ["emb"] + [f"h{i}" for i in range(max_h + 1)]
+    has_post_norm = os.path.exists(os.path.join(args.engine_dir, "post_norm.bin"))
+    if has_post_norm:
+        slots.append("post_norm")
+
+    print(f"{'slot':<12} {'dim':>6} {'us_norm':>10} {'hf_norm':>10} "
+          f"{'max_abs':>10} {'L2_rel':>10} {'cosine':>8}  status")
+    print("-" * 85)
+
+    first_fail = None
+    all_rows = []
+    for slot in slots:
+        bin_path = os.path.join(args.engine_dir, f"{slot}.bin")
+        if not os.path.exists(bin_path):
+            continue
+        us = read_bin(bin_path)
+        if slot == "post_norm":
+            # HF npz doesn't usually have post_norm; skip if absent
+            if "post_norm" not in hf.files:
+                continue
+            hf_arr = hf["post_norm"]
+            hf_vec = hf_arr[pos] if hf_arr.ndim == 2 else hf_arr
+        else:
+            if slot not in hf.files:
+                continue
+            hf_arr = hf[slot]
+            hf_vec = hf_arr[pos] if hf_arr.ndim == 2 else hf_arr
+
+        if us.shape != hf_vec.shape:
+            print(f"{slot:<12} shape mismatch us={us.shape} hf={hf_vec.shape}")
+            continue
+
+        us_norm, hf_norm, max_abs, l2_rel, cosine = compare(hf_vec, us)
+        status = "PASS"
+        if l2_rel > args.threshold_l2_rel:
+            status = "FAIL"
+            if first_fail is None:
+                first_fail = slot
+
+        print(f"{slot:<12} {len(us):>6} {us_norm:>10.3f} {hf_norm:>10.3f} "
+              f"{max_abs:>10.4f} {l2_rel:>10.4%} {cosine:>8.4f}  {status}")
+        all_rows.append((slot, status, l2_rel, cosine))
+
+    # Compare top-5 logits
+    logits_path = os.path.join(args.engine_dir, "logits.bin")
+    logits_pass = True
+    if os.path.exists(logits_path) and "logits" in hf.files:
+        us_l = read_bin(logits_path)
+        hf_l = hf["logits"][pos] if hf["logits"].ndim == 2 else hf["logits"]
+        if us_l.shape == hf_l.shape:
+            top1_us = int(us_l.argmax())
+            top1_hf = int(hf_l.argmax())
+            cos_l = float(np.dot(us_l, hf_l) /
+                          max(np.linalg.norm(us_l) * np.linalg.norm(hf_l), 1e-9))
+            print()
+            print(f"logits cosine={cos_l:.4f}  top1 hf={top1_hf} us={top1_us}  "
+                  f"{'PASS' if (cos_l >= args.threshold_cosine and top1_us == top1_hf) else 'FAIL'}")
+            if cos_l < args.threshold_cosine or top1_us != top1_hf:
+                logits_pass = False
+                if first_fail is None:
+                    first_fail = "logits"
+
+    print()
+    if first_fail is None and logits_pass:
+        print("→ PASS — all layers within threshold")
+        return 0
+    else:
+        print(f"→ FAIL — first divergence at {first_fail}")
+        return 1
+
+
+if __name__ == "__main__":
+    sys.exit(main())
@@ -0,0 +1,47 @@
+#!/usr/bin/env bash
+# Run quant.cpp engine with TQ_DUMP_HIDDEN to capture per-layer hidden states
+# for reference-parity comparison.
+#
+# Usage:
+#   ./engine_reference.sh <model.gguf> "<prompt>" <out_dir>
+#
+# Output: <out_dir>/emb.bin, h0.bin, h1.bin, ..., post_norm.bin, logits.bin
+# Each file is raw FP32 little-endian; shape derived from model config.
+
+set -eu
+
+GGUF="$1"
+PROMPT="$2"
+OUT_DIR="$3"
+
+BIN="${BIN:-$(cd "$(dirname "$0")/../.." && pwd)/build/quant}"
+
+if [[ ! -x "$BIN" ]]; then
+    echo "error: $BIN not executable" >&2
+    exit 2
+fi
+if [[ ! -f "$GGUF" ]]; then
+    echo "error: model file not found: $GGUF" >&2
+    exit 2
+fi
+
+rm -rf "$OUT_DIR"
+mkdir -p "$OUT_DIR"
+
+# Force per-token prefill so pos=0 dump captures the first token.
+# Use TQ_DUMP_POS=0 (default in tq_dump_hidden).
+TQ_NO_METAL=1 TQ_NO_MLOCK=1 TQ_NO_BATCH_PREFILL=1 \
+TQ_NO_AUTO_SERIAL=1 \
+TQ_DUMP_HIDDEN="$OUT_DIR" \
+"$BIN" "$GGUF" -p "$PROMPT" -n 1 -T 0 >"$OUT_DIR/engine.log" 2>&1 || {
+    echo "error: engine run failed; see $OUT_DIR/engine.log" >&2
+    exit 2
+}
+
+# Verify dumps were produced
+if ! ls "$OUT_DIR"/h0.bin >/dev/null 2>&1; then
+    echo "error: no h0.bin in $OUT_DIR — dump did not fire" >&2
+    exit 2
+fi
+
+echo "[refparity/engine] dumped to $OUT_DIR/ ($(ls "$OUT_DIR"/*.bin | wc -l | tr -d ' ') files)" >&2