You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
R26 had added eps=1e-6 to l2_normalize but used formulation
`1/sqrt(ss + eps)`. llama.cpp's ggml_compute_forward_l2_norm_f32
uses `1/max(sqrt(ss), eps)`. Formulations agree for typical inputs
(scale ~ 1) but differ by 3 orders of magnitude for near-zero K/Q:
ours 1e3, reference 1e6. Over 30 DeltaNet layers × decode positions,
systematic under-scaling compounds into the decode-length
degradation chased across Pillars 1, 1.5, and 30+ Mission C rounds.
Fix in src/engine/tq_transformer.c:l2_normalize (both NEON and
scalar paths) — bit-equivalent to ggml now.
A/B Qwen3.6-35B IQ4_XS auto-serial "Write a 300-word essay about AI."
with -n 300:
v0.25.0: 117 tokens, ~45 coherent then "the new normal" loop
v0.26.0: 160 tokens, ~110 coherent content before mild drift
→ +36% total length, +144% coherent content, more varied output
Discovered via direct diff against:
refs/llama.cpp/ggml/src/ggml-cpu/ops.cpp::ggml_compute_forward_l2_norm_f32
Honest status: not yet "1000+ char coherent generation" — still
drifts after ~110 tokens on some prompts. But compounds with prior
fixes (v0.19-0.25) as another layer of root-cause closure.
Regression: 15/15 test_models + 4/4 test_tokenizer PASS.
Methodology note: R26's "needs eps" diagnosis was correct but the
formulation was paraphrased, not copied. Always ship the EXACT
reference implementation first, optimize later.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.ko.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -76,6 +76,8 @@ Chunk-RAG가 잘못된 섹션을 검색하면, 모델은 **"모른다"고 하지
76
76
77
77
> **v2 후속 — Working Memory Cliff (2026-04-11)**: v1 결과를 더 큰 grid로 확장 측정했습니다 (1B/3B 모델, ctx 256-2048, 204 NIAH trials + FP32-weights 통제 실험). 두 모델 모두 명목 128K context window의 **1% 미만**에서 sharp cliff가 존재합니다 (1B Q8 cliff 512-1024, 3B Q4 cliff 1024-1280을 **step function**으로). 6.4× KV 압축은 20개 cell 중 18개에서 fp32 baseline과 bit-for-bit 일치 — cliff는 model property이지 KV/weight quantization artifact가 아닙니다. 정직한 재해석: Beyond RAG는 *유효* working memory 안에 들어가는 문서에 대해서만 동작하며, 그 크기는 명목 context window의 100분의 1에서 1000분의 1입니다. 전체 tech report: [`docs/paper/working-memory-cliff.md`](docs/paper/working-memory-cliff.md). HuggingFace blog post draft: [`docs/paper/hf-blog-draft.md`](docs/paper/hf-blog-draft.md).
78
78
79
+
> **v3.19 ★ DeltaNet L2-norm 공식이 ggml과 일치 — Qwen3.6 coherence +36% (2026-04-21)**: R26의 "eps 수정"은 올바른 진단이지만 **잘못된 공식**이었습니다. 우리는 `1/sqrt(ss + eps)`를 사용했지만 llama.cpp `ggml_l2_norm`은 `1/max(sqrt(ss), eps)`을 사용 — near-zero 입력에서 **3 orders of magnitude 차이** (1e3 vs 1e6). 30 DeltaNet 레이어 × position 에서 K/Q under-scaling이 누적 → decode-length degradation. 수정: ggml과 정확히 일치. Qwen3.6-35B IQ4_XS auto-serial 실측 "Write a 300-word essay": **117 → 160 토큰** (+36%), coherent content 45 → 110 토큰. `refs/llama.cpp/ggml/src/ggml-cpu/ops.cpp::ggml_compute_forward_l2_norm_f32` 를 우리 `l2_normalize`와 직접 diff로 발견. 15/15 regression PASS. v0.26.0.
80
+
79
81
> **v3.18 Qwen3.6 auto-serial quality mode — 결정성 + 긴 coherence (2026-04-20)**: 발견: Qwen3.6-35B 멀티쓰레드 matmul은 **T=0에서 비결정적** (동일 프롬프트 두 번 실행 = 다른 출력). 병렬 FP reduction 순서 변동이 30 MoE 레이어 × position feedback에서 누적 → top-1 argmax flip. 수정: qwen35moe+DeltaNet 하이브리드 자동 감지 후 `-j 1` 강제. **이전**: 실행마다 결과 다름, 60-70 토큰 후 degrade. **이후**: 결정적, coherent 범위 ~95 토큰으로 확장. 비용: 디코드 ~2-3× 느림 (3 t/s vs 8 t/s). Opt-out: `TQ_NO_AUTO_SERIAL=1`. **솔직한 한계**: 1000+자 생성 완전 수정은 **아직 아님** — 40 레이어 × 8-expert weighted sum × IQ4_XS 양자화 오차 누적이 결국 반복 루프로 귀결. 오늘 세션 요약: 7 릴리스로 7개 Qwen3.6 버그 클래스 해결. 비결정성 제거만으로도 실용 향상 충분함. v0.25.0.
80
82
81
83
> **v3.17 MoE SwiGLU exact expf — Qwen3.6 coherence 마진 (2026-04-20)**: MoE `swiglu_fused`가 Schraudolph 근사 (~2% 오차) 대신 exact `expf` 기본 사용. R27-29에서 DeltaNet에는 적용했지만 MoE는 fast_exp 유지 중이었음. 30 MoE 레이어 × 500+ 토큰에서 오차 누적. 수정 후 400-word Qwen3.6 프롬프트가 더 길고 다양한 continuation 생성. 속도 비용: 측정 불가 (SwiGLU 병목 아님; 280w에서 28-29s TTFT 동일). Opt-out: `TQ_MOE_FAST_EXP=1`. 500+ 단어 degradation 여전 (다원 버그 중 한 원인 해결). 15/15 regression PASS. v0.24.0.
Copy file name to clipboardExpand all lines: README.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -167,6 +167,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
167
167
168
168
> **v3.2 batched prefill (2026-04-16):** Prompt prefill was the widest gap vs llama.cpp (40-50× slower). A new `tq_forward_batch` path uses batched matrix-matrix matmul via Apple AMX (`cblas_sgemm`-inspired, 1.2 TFLOPS). **Now enabled by default on all supported architectures** (Llama family, both FP32 KV and default `turbo_kv_4b` KV compression modes). On Llama-3.2-1B Q8 with a ~250-token prompt: **42.7s → 5.9s end-to-end** (**7.2× total**, with default KV compression). Output bit-identical to per-token baseline. Commits `ed4b087`, `672fea2`, `f4934e9`, plus quant K cache write support.
169
169
170
+
> **v3.19 ★ DeltaNet L2-norm formulation matches ggml — Qwen3.6 +36% coherence (2026-04-21):** R26's "eps fix" had the right diagnosis but wrong formulation. We used `1/sqrt(ss + eps)` but llama.cpp's `ggml_l2_norm` uses `1/max(sqrt(ss), eps)` — for near-zero inputs these differ by **3 orders of magnitude** (1e3 vs 1e6). Over 30 DeltaNet layers × position, systematic K/Q under-scaling compounds into decode-length degradation. Fix: match ggml exactly. Measured on Qwen3.6-35B IQ4_XS auto-serial "Write a 300-word essay": **117 → 160 tokens** (+36%), coherent content 45 → 110 tokens before drift. Discovered via direct diff of `refs/llama.cpp/ggml/src/ggml-cpu/ops.cpp::ggml_compute_forward_l2_norm_f32` against our `l2_normalize`. 15/15 regression PASS. v0.26.0.
171
+
170
172
> **v3.18 Qwen3.6 auto-serial quality mode — determinism + longer coherence (2026-04-20):** Discovery: Qwen3.6-35B multi-thread matmul is **non-deterministic at T=0** (same prompt two runs = different output). Parallel FP reduction order variance compounds over 30 MoE layers × position feedback → top-1 argmax flips. Fix: auto-detect qwen35moe+DeltaNet hybrid and force `-j 1`. **Before**: repeats differ run-to-run, degrades 60-70 tokens. **After**: deterministic, extends coherent window to ~95 tokens. Cost: ~2-3× slower decode (3 t/s vs 8 t/s). Opt-out: `TQ_NO_AUTO_SERIAL=1`. Honest limit: **still not a full fix for 1000+ char generation** — numerical precision accumulation over 40 layers × 8-expert weighted sum × IQ4_XS quantization drifts into repetition eventually. Session arc day summary: 7 releases closing 7 distinct Qwen3.6 bug classes. Still worth shipping because deterministic output is usable, non-det was not. v0.25.0.
171
173
172
174
> **v3.17 MoE SwiGLU exact expf — Qwen3.6 coherence margin (2026-04-20):** MoE `swiglu_fused` now uses exact `expf` by default instead of Schraudolph (~2% per-call error). R27-29 had fixed this for DeltaNet but MoE kept fast_exp. With 30 MoE layers × 500+ tokens, the error compounds. After fix, 400-word Qwen3.6 prompts produce longer, more varied continuation. Speed cost: unmeasurable (SwiGLU not bottleneck; 28-29s TTFT identical before/after on 280w). Opt-out: `TQ_MOE_FAST_EXP=1`. 500+ word degradation still exists (multi-source bug, this is one contributor). 15/15 regression PASS. v0.24.0.
0 commit comments