fix(chat-cache): comprehensive audit — 7 hidden bugs eliminated#52
Merged
fix(chat-cache): comprehensive audit — 7 hidden bugs eliminated#52
Conversation
After PRs #48-#51 the chat KV cache reuse path was a complex multi-layer system. Audited every code path for hidden bugs and fixed all of them. ## Bugs found and fixed 1. **Slow-path fallback corrupted KV state** [P0] tq_generate_chat_text's overflow fallback called tq_generate_continue on the SAME state that already had old KV at positions [0..prefix_pos). New prefill would write [0..n_new) leaving stale [n_new..prefix_pos) that subsequent generation might read. Replaced with -2 return code: the caller decides (server returns HTTP 413, WASM auto-resets the chat and shows a status message). 2. **WASM reset_chat partial cleanup** [P1] wasm_reset_chat called quant_chat(NULL) but did not reset g_output_pos / g_output[0] / g_stream_count, so the next generation would append to stale text from the previous chat. Now resets all. 3. **wasm_generate (sync path) missed g_stream_count reset** [P1] The async path zeroed it, the sync path did not. Aligned both. 4. **Wheel header _quant.h stale** [P0] bindings/python/quantcpp/_quant.h is .gitignore'd and the next pip build would have used quant.h from before PR #51 (no tq_generate_chat_text). Synced to current quant.h. 5. **Overflow surface — WASM** [P1] Added n == -2 detection in wasm_generate / wasm_generate_async. Auto-reset chat and call js_on_status with a clear error message so the JS side can show "Context full — chat reset". 6. **Overflow surface — server** [P1] Added gen_rc == -2 detection in both streaming and non-streaming handlers. Server resets the session's KV state + cached_text + tokens and returns HTTP 413 with an OpenAI-compatible error JSON. 7. **tq_generate_continue cached_text drift documentation** [P2] Added a header comment explaining that tq_generate_continue is the lower-level API and doesn't track cached_text. Higher-level callers must use tq_generate_chat_text for cached_text safety. ## Audited but safe - Server session concurrency: get_or_create_session is called inside inference_mutex, so LRU bookkeeping is serialized. - json_extract_string buffer safety: respects buf_size - 1 bound. - WASM g_output overflow: tokens dropped from local buffer but js_on_token still fires, so JS side gets all output. Acceptable. ## Verified end-to-end alice/bob interleaved 5 turns each (real assistant replay): alice: 339 → 514 ms (~50 ms/turn growth from O(n) attention) bob: 310 → 518 ms (similar) No regressions; all turns hit the FAST text-prefix path after turn 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7b7c316 to
da825bf
Compare
unamedkr
added a commit
that referenced
this pull request
Apr 12, 2026
Follow-up to PR #52. A fresh code-reading audit found another batch of hidden bugs in the chat KV cache path. None had visible symptoms in the happy path; all were latent failure modes that would surface under load, on long histories, or on memory pressure. Bugs fixed: CRITICAL - B1: tq_generate_continue's sliding-window truncation silently desynced cached_text. cached_text claimed the FULL prompt was committed, but cached_tokens only held the truncated tail — next turn's text-prefix match mapped text bytes to the wrong KV positions. Fix: continue now returns -2 on overflow instead of truncating. - B2: cached_text was updated even when generation returned an error, leaving the cache claiming committed bytes that weren't. - B3: chat_accum_callback realloc failure silently dropped tokens AND skipped the user's stream callback — broken UX + corrupted cached_text. Fix: always pass tokens to user_cb; mark accumulator tainted on realloc failure; skip cached_text update if tainted. - B4: server's get_or_create_session didn't NULL-check tq_create_state_ex. An OOM made the next call dereference a NULL kv_state. - B5: CLI cmd_run interactive loop ignored quant_chat return code, so context overflow produced an infinite stream of empty replies. Fix: catch ChatContextOverflow, drop oldest turn, retry. HIGH - B6: server streaming path only branched on rc == -2; rc == -1 produced HTTP 200 with finish_reason "stop" and no error info. Now sends an error delta + finish_reason "error". - B7: server reused an existing session even when the request changed kv_type / value_quant_bits — old quantized blocks would be misinterpreted. Now detects the change and rebuilds state. - B8: WASM wasm_load_model didn't reset g_generating. After a page-reload mid-stream, every subsequent generate call early-returned -1 forever. - B9: rep_penalty was silently ignored in tq_generate_chat_text's FAST path (slow path applied it). Now mirrors the slow path. - B10: Python Model.chat() ignored the C return value; -2 / -1 surfaced as empty token streams. Now raises ChatContextOverflow / RuntimeError. MEDIUM - Removed dead `update_cache:` label. - Synced bindings/python/quant.h (sdist staging snapshot). Verification: - ctest --test-dir build → 35/35 passed - cmake --build build → all targets clean - wasm/build.sh → quant.wasm rebuilt clean (320K) - Python imports + cli --help work quant.h and src/engine/tq_generate.c are kept in lockstep (every chat-cache change applied to both). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 tasks
unamedkr
added a commit
that referenced
this pull request
Apr 12, 2026
Follow-up to PR #52. A fresh code-reading audit found another batch of hidden bugs in the chat KV cache path. None had visible symptoms in the happy path; all were latent failure modes that would surface under load, on long histories, or on memory pressure. Bugs fixed: CRITICAL - B1: tq_generate_continue's sliding-window truncation silently desynced cached_text. cached_text claimed the FULL prompt was committed, but cached_tokens only held the truncated tail — next turn's text-prefix match mapped text bytes to the wrong KV positions. Fix: continue now returns -2 on overflow instead of truncating. - B2: cached_text was updated even when generation returned an error, leaving the cache claiming committed bytes that weren't. - B3: chat_accum_callback realloc failure silently dropped tokens AND skipped the user's stream callback — broken UX + corrupted cached_text. Fix: always pass tokens to user_cb; mark accumulator tainted on realloc failure; skip cached_text update if tainted. - B4: server's get_or_create_session didn't NULL-check tq_create_state_ex. An OOM made the next call dereference a NULL kv_state. - B5: CLI cmd_run interactive loop ignored quant_chat return code, so context overflow produced an infinite stream of empty replies. Fix: catch ChatContextOverflow, drop oldest turn, retry. HIGH - B6: server streaming path only branched on rc == -2; rc == -1 produced HTTP 200 with finish_reason "stop" and no error info. Now sends an error delta + finish_reason "error". - B7: server reused an existing session even when the request changed kv_type / value_quant_bits — old quantized blocks would be misinterpreted. Now detects the change and rebuilds state. - B8: WASM wasm_load_model didn't reset g_generating. After a page-reload mid-stream, every subsequent generate call early-returned -1 forever. - B9: rep_penalty was silently ignored in tq_generate_chat_text's FAST path (slow path applied it). Now mirrors the slow path. - B10: Python Model.chat() ignored the C return value; -2 / -1 surfaced as empty token streams. Now raises ChatContextOverflow / RuntimeError. MEDIUM - Removed dead `update_cache:` label. - Synced bindings/python/quant.h (sdist staging snapshot). Verification: - ctest --test-dir build → 35/35 passed - cmake --build build → all targets clean - wasm/build.sh → quant.wasm rebuilt clean (320K) - Python imports + cli --help work quant.h and src/engine/tq_generate.c are kept in lockstep (every chat-cache change applied to both). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
After PRs #48-#51 the chat KV cache reuse path was a complex multi-layer system. Audited every code path for hidden bugs.
Bugs found (7 total)
Fix strategy
Bug 1 (the worst): instead of falling back to tq_generate_continue with corrupted state, return -2 from tq_generate_chat_text. Callers handle:
Bug 2/3: align reset/init paths for g_output_pos / g_stream_count.
Bug 4: copy quant.h → bindings/python/quantcpp/_quant.h (verified 3 occurrences of tq_generate_chat_text).
Bug 5/6: see Bug 1.
Bug 7: documented in tq_generate_continue header comment.
Audited but safe (no fix needed)
Verified end-to-end (no regressions)
```
alice/bob interleaved 5 turns each (real assistant replay):
alice: 339 → 514 ms (~50 ms/turn from O(n) attention, expected)
bob: 310 → 518 ms
All turns hit the FAST text-prefix path after turn 1.
```
🤖 Generated with Claude Code