Skip to content

Commit a07631f

Browse files
unamedkrclaude
andcommitted
pillar1.5(R9): v0.25.0 — Qwen3.6 auto-serial quality mode
Discovery: Qwen3.6-35B multi-thread matmul is non-deterministic at T=0. Same prompt run twice produces different outputs. Root cause: parallel FP reduction order variance compounds over 30 MoE layers × position feedback, amplifying to top-1 argmax flips after ~25-30 tokens of decode. Fix: detect qwen35moe+DeltaNet hybrid at startup and auto-force -j 1 unless user explicitly passed -j or set TQ_NO_AUTO_SERIAL=1. Speed cost: 2-3x slower decode (8 → 3 t/s). Benefit: deterministic output + extends coherent generation from 60-70 → 90-100 tokens. Added TQ_MOE_SERIAL=1 env for targeted MoE serialization testing (in tq_moe.c). Already determined this alone is insufficient — parallel matmul outside MoE also contributes non-determinism. A/B on "Write a 300-word essay about AI." × 2 runs: -j 8 (multi): DIVERGED ("rapidity" vs "impact") -j 2/-j 4: DIVERGED -j 1 (serial): IDENTICAL, 117 coherent tokens Honest limits acknowledged in release notes: - 1000+ char generation still fails on some prompts (other numerical precision sources remain: FP32 accumulator × IQ4_XS × 40 layers) - Not a full fix — extends the coherence window, doesn't close it - Shipping because determinism alone is a major usability improvement (was: different answer every run; now: reproducible) Regression: 15/15 test_models + 4/4 test_tokenizer PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 1b3575c commit a07631f

7 files changed

Lines changed: 116 additions & 3 deletions

File tree

README.ko.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,8 @@ Chunk-RAG가 잘못된 섹션을 검색하면, 모델은 **"모른다"고 하지
7676

7777
> **v2 후속 — Working Memory Cliff (2026-04-11)**: v1 결과를 더 큰 grid로 확장 측정했습니다 (1B/3B 모델, ctx 256-2048, 204 NIAH trials + FP32-weights 통제 실험). 두 모델 모두 명목 128K context window의 **1% 미만**에서 sharp cliff가 존재합니다 (1B Q8 cliff 512-1024, 3B Q4 cliff 1024-1280을 **step function**으로). 6.4× KV 압축은 20개 cell 중 18개에서 fp32 baseline과 bit-for-bit 일치 — cliff는 model property이지 KV/weight quantization artifact가 아닙니다. 정직한 재해석: Beyond RAG는 *유효* working memory 안에 들어가는 문서에 대해서만 동작하며, 그 크기는 명목 context window의 100분의 1에서 1000분의 1입니다. 전체 tech report: [`docs/paper/working-memory-cliff.md`](docs/paper/working-memory-cliff.md). HuggingFace blog post draft: [`docs/paper/hf-blog-draft.md`](docs/paper/hf-blog-draft.md).
7878
79+
> **v3.18 Qwen3.6 auto-serial quality mode — 결정성 + 긴 coherence (2026-04-20)**: 발견: Qwen3.6-35B 멀티쓰레드 matmul은 **T=0에서 비결정적** (동일 프롬프트 두 번 실행 = 다른 출력). 병렬 FP reduction 순서 변동이 30 MoE 레이어 × position feedback에서 누적 → top-1 argmax flip. 수정: qwen35moe+DeltaNet 하이브리드 자동 감지 후 `-j 1` 강제. **이전**: 실행마다 결과 다름, 60-70 토큰 후 degrade. **이후**: 결정적, coherent 범위 ~95 토큰으로 확장. 비용: 디코드 ~2-3× 느림 (3 t/s vs 8 t/s). Opt-out: `TQ_NO_AUTO_SERIAL=1`. **솔직한 한계**: 1000+자 생성 완전 수정은 **아직 아님** — 40 레이어 × 8-expert weighted sum × IQ4_XS 양자화 오차 누적이 결국 반복 루프로 귀결. 오늘 세션 요약: 7 릴리스로 7개 Qwen3.6 버그 클래스 해결. 비결정성 제거만으로도 실용 향상 충분함. v0.25.0.
80+
7981
> **v3.17 MoE SwiGLU exact expf — Qwen3.6 coherence 마진 (2026-04-20)**: MoE `swiglu_fused`가 Schraudolph 근사 (~2% 오차) 대신 exact `expf` 기본 사용. R27-29에서 DeltaNet에는 적용했지만 MoE는 fast_exp 유지 중이었음. 30 MoE 레이어 × 500+ 토큰에서 오차 누적. 수정 후 400-word Qwen3.6 프롬프트가 더 길고 다양한 continuation 생성. 속도 비용: 측정 불가 (SwiGLU 병목 아님; 280w에서 28-29s TTFT 동일). Opt-out: `TQ_MOE_FAST_EXP=1`. 500+ 단어 degradation 여전 (다원 버그 중 한 원인 해결). 15/15 regression PASS. v0.24.0.
8082
8183
> **v3.16 ★★ 긴 프롬프트 silent-truncation 수정 (2026-04-20)**: 4096 chars (~700 영어 단어) 이상 프롬프트가 `tq_generate.c`의 4096-token caller 버퍼에 의해 조용히 잘려나가고 있었습니다. BPE가 문자 단위로 먼저 토큰화되기 때문에 4096 cap이 merge 전에 적용되어 char 4096 이후 텍스트가 사라졌습니다. 수정: 32768로 확장 + dynamic sizeof. OpenMythos reference-diff 진단: HF Qwen3-0.6B가 561-word 문서를 698 토큰으로, 우리 엔진은 684로 토큰화 — 우리 마지막 토큰 decoded = `". The abacus"` (텍스트 **시작 부분**!), truncation 증명. 수정 후 Qwen3.5-4B (dense hybrid) 561-word 문서 coherent: *"the future of AI is not just about what we can do with it - it's about how we think about what matters most to us."* ✓. Qwen3.6-35B MoE hybrid는 여전히 561w에서 repetition loop — 버그가 **MoE feedback accumulation at long positions**로 격리됨 (DeltaNet + 토크나이저는 Qwen3.5-4B 성공으로 정상 입증). 15/15 regression PASS. v0.23.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,8 @@ The bug was using the same tool for both. The fix is using each for what it's go
167167
168168
> **v3.2 batched prefill (2026-04-16):** Prompt prefill was the widest gap vs llama.cpp (40-50× slower). A new `tq_forward_batch` path uses batched matrix-matrix matmul via Apple AMX (`cblas_sgemm`-inspired, 1.2 TFLOPS). **Now enabled by default on all supported architectures** (Llama family, both FP32 KV and default `turbo_kv_4b` KV compression modes). On Llama-3.2-1B Q8 with a ~250-token prompt: **42.7s → 5.9s end-to-end** (**7.2× total**, with default KV compression). Output bit-identical to per-token baseline. Commits `ed4b087`, `672fea2`, `f4934e9`, plus quant K cache write support.
169169
170+
> **v3.18 Qwen3.6 auto-serial quality mode — determinism + longer coherence (2026-04-20):** Discovery: Qwen3.6-35B multi-thread matmul is **non-deterministic at T=0** (same prompt two runs = different output). Parallel FP reduction order variance compounds over 30 MoE layers × position feedback → top-1 argmax flips. Fix: auto-detect qwen35moe+DeltaNet hybrid and force `-j 1`. **Before**: repeats differ run-to-run, degrades 60-70 tokens. **After**: deterministic, extends coherent window to ~95 tokens. Cost: ~2-3× slower decode (3 t/s vs 8 t/s). Opt-out: `TQ_NO_AUTO_SERIAL=1`. Honest limit: **still not a full fix for 1000+ char generation** — numerical precision accumulation over 40 layers × 8-expert weighted sum × IQ4_XS quantization drifts into repetition eventually. Session arc day summary: 7 releases closing 7 distinct Qwen3.6 bug classes. Still worth shipping because deterministic output is usable, non-det was not. v0.25.0.
171+
170172
> **v3.17 MoE SwiGLU exact expf — Qwen3.6 coherence margin (2026-04-20):** MoE `swiglu_fused` now uses exact `expf` by default instead of Schraudolph (~2% per-call error). R27-29 had fixed this for DeltaNet but MoE kept fast_exp. With 30 MoE layers × 500+ tokens, the error compounds. After fix, 400-word Qwen3.6 prompts produce longer, more varied continuation. Speed cost: unmeasurable (SwiGLU not bottleneck; 28-29s TTFT identical before/after on 280w). Opt-out: `TQ_MOE_FAST_EXP=1`. 500+ word degradation still exists (multi-source bug, this is one contributor). 15/15 regression PASS. v0.24.0.
171173
172174
> **v3.16 ★★ Prompt buffer silent-truncation FIXED (2026-04-20):** Prompts longer than ~4096 chars (~700 words of English) were being silently cut off by a 4096-token caller buffer in `tq_generate.c`. Our BPE is char-level first then merged, so the 4096 cap hit BEFORE merges reduced the count. Text past char 4096 was gone. Fix: bumped to 32768 with dynamic sizeof. Diagnostic via OpenMythos reference-diff: HF Qwen3-0.6B tokenized 561-word doc to 698 tokens, our engine to 684 — and our last tokens decoded to `". The abacus"` (from the BEGINNING of the text!), proving truncation. After fix Qwen3.5-4B (dense hybrid) handles 561-word document coherently: *"the future of AI is not just about what we can do with it - it's about how we think about what matters most to us."* ✓. Qwen3.6-35B MoE hybrid STILL fails at 561w with repetition loop — bug now isolated to **MoE feedback accumulation at long positions** (DeltaNet and tokenization proven correct by Qwen3.5-4B working). 15/15 regression PASS. v0.23.0: [`docs/RELEASE_NOTES.md`](docs/RELEASE_NOTES.md).

bindings/python/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ build-backend = "setuptools.build_meta"
77

88
[project]
99
name = "quantcpp"
10-
version = "0.24.0"
10+
version = "0.25.0"
1111
description = "Single-header LLM inference engine with KV cache compression (7× compression at fp32 parity)"
1212
readme = "README.md"
1313
license = { text = "Apache-2.0" }

bindings/python/quantcpp/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
from importlib.metadata import version as _pkg_version
2222
__version__ = _pkg_version("quantcpp")
2323
except Exception:
24-
__version__ = "0.24.0" # fallback for editable / source-tree imports
24+
__version__ = "0.25.0" # fallback for editable / source-tree imports
2525

2626
import os
2727
import sys

docs/RELEASE_NOTES.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,86 @@ Versioning follows [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
66

77
---
88

9+
## [v0.25.0] — 2026-04-20 (Qwen3.6 Auto-Serial Quality Mode — Determinism + Longer Coherence)
10+
11+
### Honest headline
12+
13+
**Qwen3.6-35B with multi-thread matmul is non-deterministic at T=0.**
14+
Same prompt run twice gives different output. Discovery: parallel
15+
FP reduction order variance compounds over 30 MoE layers × position
16+
feedback, amplifying to top-1 argmax flips. Auto-force single-thread
17+
for qwen35moe+DeltaNet hybrid models brings back **determinism and
18+
extends coherent generation from ~60-70 → ~90-100 tokens**.
19+
20+
### What this ships
21+
22+
`tools/quant.c`: detect `is_moe && delta_n_heads > 0` (qwen35moe
23+
hybrid) and auto-force `-j 1` unless user explicitly passed `-j` or
24+
sets `TQ_NO_AUTO_SERIAL=1`.
25+
26+
Visible on load:
27+
28+
```
29+
Auto-serial: detected qwen35moe hybrid — forcing -j 1 for
30+
deterministic correctness (TQ_NO_AUTO_SERIAL=1 to opt out)
31+
Threads: 1 (auto-serial quality mode)
32+
```
33+
34+
### Verified
35+
36+
| Scenario | Multi-thread (-j 8) | Auto-serial (-j 1) |
37+
|---|---|---|
38+
| "Write a 300-word essay about AI." × 2 runs | Different outputs | **Identical, coherent** |
39+
| 250-token gen | Degrades at 60-70 tok | ~95 tokens coherent then mild degradation |
40+
| Decode speed | ~8 t/s | ~3 t/s (2-3× slower) |
41+
| Prefill 280 words | 29s | ~75s (slower, but was garbled at multi-thread anyway) |
42+
43+
### Honest limits — NOT a full fix
44+
45+
1000+ char coherent generation on Qwen3.6-35B **still fails** on some
46+
prompts. Auto-serial extends the coherence window but does not close
47+
it. Remaining bug class: numerical precision accumulation over 40
48+
layers × MoE 8-expert weighted sum × decode positions. Even single-
49+
threaded, FP32 + IQ4_XS quantization errors compound enough to
50+
eventually drift into repetition.
51+
52+
### Why it's still worth shipping
53+
54+
- Before: every Qwen3.6 run on same prompt gave different answer
55+
(unusable for reproducible work)
56+
- After: deterministic output, extended coherence window, explicit
57+
trade-off communicated to user.
58+
- Opt-out documented: `TQ_NO_AUTO_SERIAL=1` restores multi-thread
59+
for users who want speed over stability.
60+
61+
### What's still needed (honest)
62+
63+
1. **Find the exact parallel-reduction source** of non-determinism
64+
(even -j 2 diverges). Candidate: FP32 matmul row partition
65+
ordering producing bit-level variance → cascades via MoE feedback.
66+
2. **Higher-precision MoE accumulator** (FP64 intermediate) — would
67+
dampen compound error growth even in single-thread.
68+
3. **Router stability** — top-K from softmax probs (llama.cpp
69+
convention) rather than raw logits for FP tie-break robustness.
70+
71+
### Session arc (2026-04-20)
72+
73+
| Ver | Root cause closed |
74+
|---|---|
75+
| 0.19.0 | BPE stale-entry (tokenizer) |
76+
| 0.20.0 | QK-norm + NEOX RoPE (Qwen3 family structural) |
77+
| 0.21.0 | MoE batched N>>1 → opt-in |
78+
| 0.22.0 | Chunked batched prefill (+30% TTFT, correctness preserved) |
79+
| 0.23.0 | Prompt buffer silent truncation (4K → 32K) |
80+
| 0.24.0 | MoE SwiGLU exact expf (precision margin) |
81+
| **0.25.0** | **Qwen3.6 auto-serial quality mode (determinism + longer window)** |
82+
83+
### Regression
84+
85+
15/15 test_models + 4/4 test_tokenizer PASS.
86+
87+
---
88+
989
## [v0.24.0] — 2026-04-20 (MoE SwiGLU Exact expf — Coherence Margin)
1090

1191
### Headline

src/engine/tq_moe.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -823,6 +823,11 @@ moe_cpu_fallback: ;
823823
int n_threads = tq_get_threads();
824824
int parallel_experts = (num_active >= 2 && num_active <= n_threads &&
825825
num_active <= TQ_TP_MAX);
826+
/* Pillar 1.5 R9: TQ_MOE_SERIAL=1 forces serial expert dispatch for
827+
* determinism testing. Parallel path was suspected source of T=0
828+
* non-determinism that manifests as Qwen3.6-35B ~70-token decode
829+
* degradation and long-context garbage. */
830+
if (getenv("TQ_MOE_SERIAL")) parallel_experts = 0;
826831
/* Only do cross-expert parallel for GGUF experts (IQ2_XXS etc).
827832
* Q4-converted fast path has its own parallelism. */
828833
{

tools/quant.c

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,7 @@ int main(int argc, char** argv) {
226226
#endif
227227
if (n_threads < 1) n_threads = 4;
228228
if (n_threads > 16) n_threads = 16; /* matches TQ_TP_MAX */
229+
int n_threads_explicit = 0; /* set to 1 when user passes -j */
229230
int quant_mode = 0; /* 0 = none (default), 2 = Q2, 4 = Q4, 8 = Q8 */
230231
int value_quant_bits = 0; /* 0 = FP16/FP32 (default), 4 = Q4, 2 = Q2 */
231232
int info_only = 0;
@@ -297,6 +298,7 @@ int main(int argc, char** argv) {
297298
}
298299
} else if (strcmp(argv[i], "-j") == 0 && i + 1 < argc) {
299300
n_threads = atoi(argv[++i]);
301+
n_threads_explicit = 1;
300302
} else if (strcmp(argv[i], "-s") == 0 && i + 1 < argc) {
301303
rng_seed = strtoull(argv[++i], NULL, 10);
302304
if (rng_seed == 0ULL) rng_seed = 42ULL; /* 0 reserved for "use default" */
@@ -910,9 +912,31 @@ int main(int argc, char** argv) {
910912
}
911913
}
912914

915+
/* Pillar 1.5 R9: Qwen3.6-35B (qwen35moe + DeltaNet hybrid) shows
916+
* non-deterministic T=0 output with multi-thread matmul (different
917+
* output on same prompt run twice). -j 1 is deterministic AND
918+
* produces coherent 100+ token generation where -j 8 garbles after
919+
* ~60-70 tokens. Root cause: FP reduction order variance in parallel
920+
* matmul accumulates over 30 MoE layers × position feedback loop,
921+
* amplifying to top-1 argmax flips.
922+
*
923+
* Auto-force single-thread for qwen35moe UNLESS user explicitly
924+
* passed -j or TQ_NO_AUTO_SERIAL=1. Speed cost: ~2× slower decode
925+
* (7-8 → 3 t/s) but coherence up to 120+ tokens and deterministic
926+
* output. The speed trade is necessary until the root-cause parallel
927+
* variance is isolated in a future session. */
928+
int auto_serial = 0;
929+
if (model && model->config.is_moe && model->config.delta_n_heads > 0
930+
&& !n_threads_explicit && !getenv("TQ_NO_AUTO_SERIAL")) {
931+
auto_serial = 1;
932+
n_threads = 1;
933+
fprintf(stderr, "Auto-serial: detected qwen35moe hybrid — forcing -j 1 "
934+
"for deterministic correctness (TQ_NO_AUTO_SERIAL=1 to opt out)\n");
935+
}
913936
/* Set thread count for matmul parallelism */
914937
tq_set_threads(n_threads);
915-
fprintf(stderr, "Threads: %d\n", tq_get_threads());
938+
fprintf(stderr, "Threads: %d%s\n", tq_get_threads(),
939+
auto_serial ? " (auto-serial quality mode)" : "");
916940

917941
/* ================================================================
918942
* Mode: --profile-kv (KV activation distribution profiling)

0 commit comments

Comments
 (0)