You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Discovery: Qwen3.6-35B multi-thread matmul is non-deterministic at
T=0. Same prompt run twice produces different outputs. Root cause:
parallel FP reduction order variance compounds over 30 MoE layers ×
position feedback, amplifying to top-1 argmax flips after ~25-30
tokens of decode.
Fix: detect qwen35moe+DeltaNet hybrid at startup and auto-force
-j 1 unless user explicitly passed -j or set TQ_NO_AUTO_SERIAL=1.
Speed cost: 2-3x slower decode (8 → 3 t/s). Benefit: deterministic
output + extends coherent generation from 60-70 → 90-100 tokens.
Added TQ_MOE_SERIAL=1 env for targeted MoE serialization testing
(in tq_moe.c). Already determined this alone is insufficient —
parallel matmul outside MoE also contributes non-determinism.
A/B on "Write a 300-word essay about AI." × 2 runs:
-j 8 (multi): DIVERGED ("rapidity" vs "impact")
-j 2/-j 4: DIVERGED
-j 1 (serial): IDENTICAL, 117 coherent tokens
Honest limits acknowledged in release notes:
- 1000+ char generation still fails on some prompts (other numerical
precision sources remain: FP32 accumulator × IQ4_XS × 40 layers)
- Not a full fix — extends the coherence window, doesn't close it
- Shipping because determinism alone is a major usability improvement
(was: different answer every run; now: reproducible)
Regression: 15/15 test_models + 4/4 test_tokenizer PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.ko.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -76,6 +76,8 @@ Chunk-RAG가 잘못된 섹션을 검색하면, 모델은 **"모른다"고 하지
76
76
77
77
> **v2 후속 — Working Memory Cliff (2026-04-11)**: v1 결과를 더 큰 grid로 확장 측정했습니다 (1B/3B 모델, ctx 256-2048, 204 NIAH trials + FP32-weights 통제 실험). 두 모델 모두 명목 128K context window의 **1% 미만**에서 sharp cliff가 존재합니다 (1B Q8 cliff 512-1024, 3B Q4 cliff 1024-1280을 **step function**으로). 6.4× KV 압축은 20개 cell 중 18개에서 fp32 baseline과 bit-for-bit 일치 — cliff는 model property이지 KV/weight quantization artifact가 아닙니다. 정직한 재해석: Beyond RAG는 *유효* working memory 안에 들어가는 문서에 대해서만 동작하며, 그 크기는 명목 context window의 100분의 1에서 1000분의 1입니다. 전체 tech report: [`docs/paper/working-memory-cliff.md`](docs/paper/working-memory-cliff.md). HuggingFace blog post draft: [`docs/paper/hf-blog-draft.md`](docs/paper/hf-blog-draft.md).
78
78
79
+
> **v3.18 Qwen3.6 auto-serial quality mode — 결정성 + 긴 coherence (2026-04-20)**: 발견: Qwen3.6-35B 멀티쓰레드 matmul은 **T=0에서 비결정적** (동일 프롬프트 두 번 실행 = 다른 출력). 병렬 FP reduction 순서 변동이 30 MoE 레이어 × position feedback에서 누적 → top-1 argmax flip. 수정: qwen35moe+DeltaNet 하이브리드 자동 감지 후 `-j 1` 강제. **이전**: 실행마다 결과 다름, 60-70 토큰 후 degrade. **이후**: 결정적, coherent 범위 ~95 토큰으로 확장. 비용: 디코드 ~2-3× 느림 (3 t/s vs 8 t/s). Opt-out: `TQ_NO_AUTO_SERIAL=1`. **솔직한 한계**: 1000+자 생성 완전 수정은 **아직 아님** — 40 레이어 × 8-expert weighted sum × IQ4_XS 양자화 오차 누적이 결국 반복 루프로 귀결. 오늘 세션 요약: 7 릴리스로 7개 Qwen3.6 버그 클래스 해결. 비결정성 제거만으로도 실용 향상 충분함. v0.25.0.
80
+
79
81
> **v3.17 MoE SwiGLU exact expf — Qwen3.6 coherence 마진 (2026-04-20)**: MoE `swiglu_fused`가 Schraudolph 근사 (~2% 오차) 대신 exact `expf` 기본 사용. R27-29에서 DeltaNet에는 적용했지만 MoE는 fast_exp 유지 중이었음. 30 MoE 레이어 × 500+ 토큰에서 오차 누적. 수정 후 400-word Qwen3.6 프롬프트가 더 길고 다양한 continuation 생성. 속도 비용: 측정 불가 (SwiGLU 병목 아님; 280w에서 28-29s TTFT 동일). Opt-out: `TQ_MOE_FAST_EXP=1`. 500+ 단어 degradation 여전 (다원 버그 중 한 원인 해결). 15/15 regression PASS. v0.24.0.
80
82
81
83
> **v3.16 ★★ 긴 프롬프트 silent-truncation 수정 (2026-04-20)**: 4096 chars (~700 영어 단어) 이상 프롬프트가 `tq_generate.c`의 4096-token caller 버퍼에 의해 조용히 잘려나가고 있었습니다. BPE가 문자 단위로 먼저 토큰화되기 때문에 4096 cap이 merge 전에 적용되어 char 4096 이후 텍스트가 사라졌습니다. 수정: 32768로 확장 + dynamic sizeof. OpenMythos reference-diff 진단: HF Qwen3-0.6B가 561-word 문서를 698 토큰으로, 우리 엔진은 684로 토큰화 — 우리 마지막 토큰 decoded = `". The abacus"` (텍스트 **시작 부분**!), truncation 증명. 수정 후 Qwen3.5-4B (dense hybrid) 561-word 문서 coherent: *"the future of AI is not just about what we can do with it - it's about how we think about what matters most to us."* ✓. Qwen3.6-35B MoE hybrid는 여전히 561w에서 repetition loop — 버그가 **MoE feedback accumulation at long positions**로 격리됨 (DeltaNet + 토크나이저는 Qwen3.5-4B 성공으로 정상 입증). 15/15 regression PASS. v0.23.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).
Copy file name to clipboardExpand all lines: README.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -167,6 +167,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
167
167
168
168
> **v3.2 batched prefill (2026-04-16):** Prompt prefill was the widest gap vs llama.cpp (40-50× slower). A new `tq_forward_batch` path uses batched matrix-matrix matmul via Apple AMX (`cblas_sgemm`-inspired, 1.2 TFLOPS). **Now enabled by default on all supported architectures** (Llama family, both FP32 KV and default `turbo_kv_4b` KV compression modes). On Llama-3.2-1B Q8 with a ~250-token prompt: **42.7s → 5.9s end-to-end** (**7.2× total**, with default KV compression). Output bit-identical to per-token baseline. Commits `ed4b087`, `672fea2`, `f4934e9`, plus quant K cache write support.
169
169
170
+
> **v3.18 Qwen3.6 auto-serial quality mode — determinism + longer coherence (2026-04-20):** Discovery: Qwen3.6-35B multi-thread matmul is **non-deterministic at T=0** (same prompt two runs = different output). Parallel FP reduction order variance compounds over 30 MoE layers × position feedback → top-1 argmax flips. Fix: auto-detect qwen35moe+DeltaNet hybrid and force `-j 1`. **Before**: repeats differ run-to-run, degrades 60-70 tokens. **After**: deterministic, extends coherent window to ~95 tokens. Cost: ~2-3× slower decode (3 t/s vs 8 t/s). Opt-out: `TQ_NO_AUTO_SERIAL=1`. Honest limit: **still not a full fix for 1000+ char generation** — numerical precision accumulation over 40 layers × 8-expert weighted sum × IQ4_XS quantization drifts into repetition eventually. Session arc day summary: 7 releases closing 7 distinct Qwen3.6 bug classes. Still worth shipping because deterministic output is usable, non-det was not. v0.25.0.
171
+
170
172
> **v3.17 MoE SwiGLU exact expf — Qwen3.6 coherence margin (2026-04-20):** MoE `swiglu_fused` now uses exact `expf` by default instead of Schraudolph (~2% per-call error). R27-29 had fixed this for DeltaNet but MoE kept fast_exp. With 30 MoE layers × 500+ tokens, the error compounds. After fix, 400-word Qwen3.6 prompts produce longer, more varied continuation. Speed cost: unmeasurable (SwiGLU not bottleneck; 28-29s TTFT identical before/after on 280w). Opt-out: `TQ_MOE_FAST_EXP=1`. 500+ word degradation still exists (multi-source bug, this is one contributor). 15/15 regression PASS. v0.24.0.
171
173
172
174
> **v3.16 ★★ Prompt buffer silent-truncation FIXED (2026-04-20):** Prompts longer than ~4096 chars (~700 words of English) were being silently cut off by a 4096-token caller buffer in `tq_generate.c`. Our BPE is char-level first then merged, so the 4096 cap hit BEFORE merges reduced the count. Text past char 4096 was gone. Fix: bumped to 32768 with dynamic sizeof. Diagnostic via OpenMythos reference-diff: HF Qwen3-0.6B tokenized 561-word doc to 698 tokens, our engine to 684 — and our last tokens decoded to `". The abacus"` (from the BEGINNING of the text!), proving truncation. After fix Qwen3.5-4B (dense hybrid) handles 561-word document coherently: *"the future of AI is not just about what we can do with it - it's about how we think about what matters most to us."* ✓. Qwen3.6-35B MoE hybrid STILL fails at 561w with repetition loop — bug now isolated to **MoE feedback accumulation at long positions** (DeltaNet and tokenization proven correct by Qwen3.5-4B working). 15/15 regression PASS. v0.23.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).
0 commit comments