fix(chat-cache): comprehensive audit — 7 hidden bugs eliminated by unamedkr · Pull Request #52 · quantumaikr/quant.cpp

unamedkr · 2026-04-12T00:50:03Z

After PRs #48-#51 the chat KV cache reuse path was a complex multi-layer system. Audited every code path for hidden bugs.

Bugs found (7 total)

#	Bug	Severity
1	Slow-path fallback corrupted KV state — old KV at [n_new..prefix_pos) was never cleared	🔴 P0
2	wasm_reset_chat partial cleanup (g_output_pos / g_stream_count not reset)	🟡 P1
3	wasm_generate (sync path) missed g_stream_count reset	🟡 P1
4	_quant.h wheel header stale — next pip build would miss PR #51	🔴 P0
5	No overflow surface in WASM — silent failure on long histories	🟡 P1
6	No overflow surface in server — HTTP 200 with garbage instead of 413	🟡 P1
7	tq_generate_continue cached_text drift not documented	🟢 P2

Fix strategy

Bug 1 (the worst): instead of falling back to tq_generate_continue with corrupted state, return -2 from tq_generate_chat_text. Callers handle:

WASM: auto-reset chat + status message
Server: HTTP 413 + OpenAI-compatible error JSON + session reset

Bug 2/3: align reset/init paths for g_output_pos / g_stream_count.

Bug 4: copy quant.h → bindings/python/quantcpp/_quant.h (verified 3 occurrences of tq_generate_chat_text).

Bug 5/6: see Bug 1.

Bug 7: documented in tq_generate_continue header comment.

Audited but safe (no fix needed)

Server session concurrency: get_or_create_session is called inside inference_mutex
json_extract_string: respects buf_size - 1 bound
WASM g_output overflow: tokens dropped from local buffer but js_on_token still fires (JS side gets all output)
Stale KV at positions > pos: attention only reads [0..pos], so unread positions are harmless

Verified end-to-end (no regressions)

```
alice/bob interleaved 5 turns each (real assistant replay):
alice: 339 → 514 ms (~50 ms/turn from O(n) attention, expected)
bob: 310 → 518 ms
All turns hit the FAST text-prefix path after turn 1.
```

🤖 Generated with Claude Code

After PRs #48-#51 the chat KV cache reuse path was a complex multi-layer system. Audited every code path for hidden bugs and fixed all of them. ## Bugs found and fixed 1. **Slow-path fallback corrupted KV state** [P0] tq_generate_chat_text's overflow fallback called tq_generate_continue on the SAME state that already had old KV at positions [0..prefix_pos). New prefill would write [0..n_new) leaving stale [n_new..prefix_pos) that subsequent generation might read. Replaced with -2 return code: the caller decides (server returns HTTP 413, WASM auto-resets the chat and shows a status message). 2. **WASM reset_chat partial cleanup** [P1] wasm_reset_chat called quant_chat(NULL) but did not reset g_output_pos / g_output[0] / g_stream_count, so the next generation would append to stale text from the previous chat. Now resets all. 3. **wasm_generate (sync path) missed g_stream_count reset** [P1] The async path zeroed it, the sync path did not. Aligned both. 4. **Wheel header _quant.h stale** [P0] bindings/python/quantcpp/_quant.h is .gitignore'd and the next pip build would have used quant.h from before PR #51 (no tq_generate_chat_text). Synced to current quant.h. 5. **Overflow surface — WASM** [P1] Added n == -2 detection in wasm_generate / wasm_generate_async. Auto-reset chat and call js_on_status with a clear error message so the JS side can show "Context full — chat reset". 6. **Overflow surface — server** [P1] Added gen_rc == -2 detection in both streaming and non-streaming handlers. Server resets the session's KV state + cached_text + tokens and returns HTTP 413 with an OpenAI-compatible error JSON. 7. **tq_generate_continue cached_text drift documentation** [P2] Added a header comment explaining that tq_generate_continue is the lower-level API and doesn't track cached_text. Higher-level callers must use tq_generate_chat_text for cached_text safety. ## Audited but safe - Server session concurrency: get_or_create_session is called inside inference_mutex, so LRU bookkeeping is serialized. - json_extract_string buffer safety: respects buf_size - 1 bound. - WASM g_output overflow: tokens dropped from local buffer but js_on_token still fires, so JS side gets all output. Acceptable. ## Verified end-to-end alice/bob interleaved 5 turns each (real assistant replay): alice: 339 → 514 ms (~50 ms/turn growth from O(n) attention) bob: 310 → 518 ms (similar) No regressions; all turns hit the FAST text-prefix path after turn 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Follow-up to PR #52. A fresh code-reading audit found another batch of hidden bugs in the chat KV cache path. None had visible symptoms in the happy path; all were latent failure modes that would surface under load, on long histories, or on memory pressure. Bugs fixed: CRITICAL - B1: tq_generate_continue's sliding-window truncation silently desynced cached_text. cached_text claimed the FULL prompt was committed, but cached_tokens only held the truncated tail — next turn's text-prefix match mapped text bytes to the wrong KV positions. Fix: continue now returns -2 on overflow instead of truncating. - B2: cached_text was updated even when generation returned an error, leaving the cache claiming committed bytes that weren't. - B3: chat_accum_callback realloc failure silently dropped tokens AND skipped the user's stream callback — broken UX + corrupted cached_text. Fix: always pass tokens to user_cb; mark accumulator tainted on realloc failure; skip cached_text update if tainted. - B4: server's get_or_create_session didn't NULL-check tq_create_state_ex. An OOM made the next call dereference a NULL kv_state. - B5: CLI cmd_run interactive loop ignored quant_chat return code, so context overflow produced an infinite stream of empty replies. Fix: catch ChatContextOverflow, drop oldest turn, retry. HIGH - B6: server streaming path only branched on rc == -2; rc == -1 produced HTTP 200 with finish_reason "stop" and no error info. Now sends an error delta + finish_reason "error". - B7: server reused an existing session even when the request changed kv_type / value_quant_bits — old quantized blocks would be misinterpreted. Now detects the change and rebuilds state. - B8: WASM wasm_load_model didn't reset g_generating. After a page-reload mid-stream, every subsequent generate call early-returned -1 forever. - B9: rep_penalty was silently ignored in tq_generate_chat_text's FAST path (slow path applied it). Now mirrors the slow path. - B10: Python Model.chat() ignored the C return value; -2 / -1 surfaced as empty token streams. Now raises ChatContextOverflow / RuntimeError. MEDIUM - Removed dead `update_cache:` label. - Synced bindings/python/quant.h (sdist staging snapshot). Verification: - ctest --test-dir build → 35/35 passed - cmake --build build → all targets clean - wasm/build.sh → quant.wasm rebuilt clean (320K) - Python imports + cli --help work quant.h and src/engine/tq_generate.c are kept in lockstep (every chat-cache change applied to both). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Follow-up to PR #52. A fresh code-reading audit found another batch of hidden bugs in the chat KV cache path. None had visible symptoms in the happy path; all were latent failure modes that would surface under load, on long histories, or on memory pressure. Bugs fixed: CRITICAL - B1: tq_generate_continue's sliding-window truncation silently desynced cached_text. cached_text claimed the FULL prompt was committed, but cached_tokens only held the truncated tail — next turn's text-prefix match mapped text bytes to the wrong KV positions. Fix: continue now returns -2 on overflow instead of truncating. - B2: cached_text was updated even when generation returned an error, leaving the cache claiming committed bytes that weren't. - B3: chat_accum_callback realloc failure silently dropped tokens AND skipped the user's stream callback — broken UX + corrupted cached_text. Fix: always pass tokens to user_cb; mark accumulator tainted on realloc failure; skip cached_text update if tainted. - B4: server's get_or_create_session didn't NULL-check tq_create_state_ex. An OOM made the next call dereference a NULL kv_state. - B5: CLI cmd_run interactive loop ignored quant_chat return code, so context overflow produced an infinite stream of empty replies. Fix: catch ChatContextOverflow, drop oldest turn, retry. HIGH - B6: server streaming path only branched on rc == -2; rc == -1 produced HTTP 200 with finish_reason "stop" and no error info. Now sends an error delta + finish_reason "error". - B7: server reused an existing session even when the request changed kv_type / value_quant_bits — old quantized blocks would be misinterpreted. Now detects the change and rebuilds state. - B8: WASM wasm_load_model didn't reset g_generating. After a page-reload mid-stream, every subsequent generate call early-returned -1 forever. - B9: rep_penalty was silently ignored in tq_generate_chat_text's FAST path (slow path applied it). Now mirrors the slow path. - B10: Python Model.chat() ignored the C return value; -2 / -1 surfaced as empty token streams. Now raises ChatContextOverflow / RuntimeError. MEDIUM - Removed dead `update_cache:` label. - Synced bindings/python/quant.h (sdist staging snapshot). Verification: - ctest --test-dir build → 35/35 passed - cmake --build build → all targets clean - wasm/build.sh → quant.wasm rebuilt clean (320K) - Python imports + cli --help work quant.h and src/engine/tq_generate.c are kept in lockstep (every chat-cache change applied to both). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

unamedkr force-pushed the fix/chat-cache-audit branch from 7b7c316 to da825bf Compare April 12, 2026 00:50

unamedkr merged commit 05b7e11 into main Apr 12, 2026
3 checks passed

unamedkr deleted the fix/chat-cache-audit branch April 12, 2026 00:50

unamedkr mentioned this pull request Apr 12, 2026

fix(chat-cache): second audit pass — 9 more hidden bugs eliminated #53

Merged

6 tasks

unamedkr mentioned this pull request Apr 12, 2026

Non-deterministic output at temperature=0 — possible memory corruption #62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(chat-cache): comprehensive audit — 7 hidden bugs eliminated#52

fix(chat-cache): comprehensive audit — 7 hidden bugs eliminated#52
unamedkr merged 1 commit intomainfrom
fix/chat-cache-audit

unamedkr commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

unamedkr commented Apr 12, 2026

Bugs found (7 total)

Fix strategy

Audited but safe (no fix needed)

Verified end-to-end (no regressions)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant