Skip to content

fix(chat-cache): comprehensive audit — 7 hidden bugs eliminated#52

Merged
unamedkr merged 1 commit intomainfrom
fix/chat-cache-audit
Apr 12, 2026
Merged

fix(chat-cache): comprehensive audit — 7 hidden bugs eliminated#52
unamedkr merged 1 commit intomainfrom
fix/chat-cache-audit

Conversation

@unamedkr
Copy link
Copy Markdown
Collaborator

After PRs #48-#51 the chat KV cache reuse path was a complex multi-layer system. Audited every code path for hidden bugs.

Bugs found (7 total)

# Bug Severity
1 Slow-path fallback corrupted KV state — old KV at [n_new..prefix_pos) was never cleared 🔴 P0
2 wasm_reset_chat partial cleanup (g_output_pos / g_stream_count not reset) 🟡 P1
3 wasm_generate (sync path) missed g_stream_count reset 🟡 P1
4 _quant.h wheel header stale — next pip build would miss PR #51 🔴 P0
5 No overflow surface in WASM — silent failure on long histories 🟡 P1
6 No overflow surface in server — HTTP 200 with garbage instead of 413 🟡 P1
7 tq_generate_continue cached_text drift not documented 🟢 P2

Fix strategy

Bug 1 (the worst): instead of falling back to tq_generate_continue with corrupted state, return -2 from tq_generate_chat_text. Callers handle:

  • WASM: auto-reset chat + status message
  • Server: HTTP 413 + OpenAI-compatible error JSON + session reset

Bug 2/3: align reset/init paths for g_output_pos / g_stream_count.

Bug 4: copy quant.h → bindings/python/quantcpp/_quant.h (verified 3 occurrences of tq_generate_chat_text).

Bug 5/6: see Bug 1.

Bug 7: documented in tq_generate_continue header comment.

Audited but safe (no fix needed)

  • Server session concurrency: get_or_create_session is called inside inference_mutex
  • json_extract_string: respects buf_size - 1 bound
  • WASM g_output overflow: tokens dropped from local buffer but js_on_token still fires (JS side gets all output)
  • Stale KV at positions > pos: attention only reads [0..pos], so unread positions are harmless

Verified end-to-end (no regressions)

```
alice/bob interleaved 5 turns each (real assistant replay):
alice: 339 → 514 ms (~50 ms/turn from O(n) attention, expected)
bob: 310 → 518 ms
All turns hit the FAST text-prefix path after turn 1.
```

🤖 Generated with Claude Code

After PRs #48-#51 the chat KV cache reuse path was a complex multi-layer
system. Audited every code path for hidden bugs and fixed all of them.

## Bugs found and fixed

1. **Slow-path fallback corrupted KV state** [P0]
   tq_generate_chat_text's overflow fallback called tq_generate_continue
   on the SAME state that already had old KV at positions [0..prefix_pos).
   New prefill would write [0..n_new) leaving stale [n_new..prefix_pos)
   that subsequent generation might read. Replaced with -2 return code:
   the caller decides (server returns HTTP 413, WASM auto-resets the
   chat and shows a status message).

2. **WASM reset_chat partial cleanup** [P1]
   wasm_reset_chat called quant_chat(NULL) but did not reset
   g_output_pos / g_output[0] / g_stream_count, so the next generation
   would append to stale text from the previous chat. Now resets all.

3. **wasm_generate (sync path) missed g_stream_count reset** [P1]
   The async path zeroed it, the sync path did not. Aligned both.

4. **Wheel header _quant.h stale** [P0]
   bindings/python/quantcpp/_quant.h is .gitignore'd and the next pip
   build would have used quant.h from before PR #51 (no
   tq_generate_chat_text). Synced to current quant.h.

5. **Overflow surface — WASM** [P1]
   Added n == -2 detection in wasm_generate / wasm_generate_async.
   Auto-reset chat and call js_on_status with a clear error message
   so the JS side can show "Context full — chat reset".

6. **Overflow surface — server** [P1]
   Added gen_rc == -2 detection in both streaming and non-streaming
   handlers. Server resets the session's KV state + cached_text + tokens
   and returns HTTP 413 with an OpenAI-compatible error JSON.

7. **tq_generate_continue cached_text drift documentation** [P2]
   Added a header comment explaining that tq_generate_continue is the
   lower-level API and doesn't track cached_text. Higher-level callers
   must use tq_generate_chat_text for cached_text safety.

## Audited but safe

- Server session concurrency: get_or_create_session is called inside
  inference_mutex, so LRU bookkeeping is serialized.
- json_extract_string buffer safety: respects buf_size - 1 bound.
- WASM g_output overflow: tokens dropped from local buffer but
  js_on_token still fires, so JS side gets all output. Acceptable.

## Verified end-to-end

  alice/bob interleaved 5 turns each (real assistant replay):
    alice: 339 → 514 ms (~50 ms/turn growth from O(n) attention)
    bob:   310 → 518 ms (similar)

No regressions; all turns hit the FAST text-prefix path after turn 1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@unamedkr unamedkr force-pushed the fix/chat-cache-audit branch from 7b7c316 to da825bf Compare April 12, 2026 00:50
@unamedkr unamedkr merged commit 05b7e11 into main Apr 12, 2026
3 checks passed
@unamedkr unamedkr deleted the fix/chat-cache-audit branch April 12, 2026 00:50
unamedkr added a commit that referenced this pull request Apr 12, 2026
Follow-up to PR #52. A fresh code-reading audit found another batch of
hidden bugs in the chat KV cache path. None had visible symptoms in the
happy path; all were latent failure modes that would surface under load,
on long histories, or on memory pressure.

Bugs fixed:

CRITICAL
- B1: tq_generate_continue's sliding-window truncation silently desynced
  cached_text. cached_text claimed the FULL prompt was committed, but
  cached_tokens only held the truncated tail — next turn's text-prefix
  match mapped text bytes to the wrong KV positions. Fix: continue now
  returns -2 on overflow instead of truncating.
- B2: cached_text was updated even when generation returned an error,
  leaving the cache claiming committed bytes that weren't.
- B3: chat_accum_callback realloc failure silently dropped tokens AND
  skipped the user's stream callback — broken UX + corrupted cached_text.
  Fix: always pass tokens to user_cb; mark accumulator tainted on
  realloc failure; skip cached_text update if tainted.
- B4: server's get_or_create_session didn't NULL-check tq_create_state_ex.
  An OOM made the next call dereference a NULL kv_state.
- B5: CLI cmd_run interactive loop ignored quant_chat return code, so
  context overflow produced an infinite stream of empty replies. Fix:
  catch ChatContextOverflow, drop oldest turn, retry.

HIGH
- B6: server streaming path only branched on rc == -2; rc == -1 produced
  HTTP 200 with finish_reason "stop" and no error info. Now sends an
  error delta + finish_reason "error".
- B7: server reused an existing session even when the request changed
  kv_type / value_quant_bits — old quantized blocks would be
  misinterpreted. Now detects the change and rebuilds state.
- B8: WASM wasm_load_model didn't reset g_generating. After a
  page-reload mid-stream, every subsequent generate call early-returned
  -1 forever.
- B9: rep_penalty was silently ignored in tq_generate_chat_text's FAST
  path (slow path applied it). Now mirrors the slow path.
- B10: Python Model.chat() ignored the C return value; -2 / -1 surfaced
  as empty token streams. Now raises ChatContextOverflow / RuntimeError.

MEDIUM
- Removed dead `update_cache:` label.
- Synced bindings/python/quant.h (sdist staging snapshot).

Verification:
- ctest --test-dir build → 35/35 passed
- cmake --build build → all targets clean
- wasm/build.sh → quant.wasm rebuilt clean (320K)
- Python imports + cli --help work

quant.h and src/engine/tq_generate.c are kept in lockstep (every
chat-cache change applied to both).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unamedkr added a commit that referenced this pull request Apr 12, 2026
Follow-up to PR #52. A fresh code-reading audit found another batch of
hidden bugs in the chat KV cache path. None had visible symptoms in the
happy path; all were latent failure modes that would surface under load,
on long histories, or on memory pressure.

Bugs fixed:

CRITICAL
- B1: tq_generate_continue's sliding-window truncation silently desynced
  cached_text. cached_text claimed the FULL prompt was committed, but
  cached_tokens only held the truncated tail — next turn's text-prefix
  match mapped text bytes to the wrong KV positions. Fix: continue now
  returns -2 on overflow instead of truncating.
- B2: cached_text was updated even when generation returned an error,
  leaving the cache claiming committed bytes that weren't.
- B3: chat_accum_callback realloc failure silently dropped tokens AND
  skipped the user's stream callback — broken UX + corrupted cached_text.
  Fix: always pass tokens to user_cb; mark accumulator tainted on
  realloc failure; skip cached_text update if tainted.
- B4: server's get_or_create_session didn't NULL-check tq_create_state_ex.
  An OOM made the next call dereference a NULL kv_state.
- B5: CLI cmd_run interactive loop ignored quant_chat return code, so
  context overflow produced an infinite stream of empty replies. Fix:
  catch ChatContextOverflow, drop oldest turn, retry.

HIGH
- B6: server streaming path only branched on rc == -2; rc == -1 produced
  HTTP 200 with finish_reason "stop" and no error info. Now sends an
  error delta + finish_reason "error".
- B7: server reused an existing session even when the request changed
  kv_type / value_quant_bits — old quantized blocks would be
  misinterpreted. Now detects the change and rebuilds state.
- B8: WASM wasm_load_model didn't reset g_generating. After a
  page-reload mid-stream, every subsequent generate call early-returned
  -1 forever.
- B9: rep_penalty was silently ignored in tq_generate_chat_text's FAST
  path (slow path applied it). Now mirrors the slow path.
- B10: Python Model.chat() ignored the C return value; -2 / -1 surfaced
  as empty token streams. Now raises ChatContextOverflow / RuntimeError.

MEDIUM
- Removed dead `update_cache:` label.
- Synced bindings/python/quant.h (sdist staging snapshot).

Verification:
- ctest --test-dir build → 35/35 passed
- cmake --build build → all targets clean
- wasm/build.sh → quant.wasm rebuilt clean (320K)
- Python imports + cli --help work

quant.h and src/engine/tq_generate.c are kept in lockstep (every
chat-cache change applied to both).

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant