Skip to content

Commit 92b2f6c

Browse files
unamedkrclaude
andcommitted
paper(working-memory-cliff): §5.6 — CLI seed bug now fixed (a8f6d8a)
Updates the limitations section to reflect that the seed-sweep CLI bug was fixed in the same Karpathy round that discovered it. The seed-controlled sampling sweep itself is still pending (out of time budget for v1) but is now unblocked — bench/niah_seed_sweep.sh works correctly post-fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent a8f6d8a commit 92b2f6c

2 files changed

Lines changed: 27 additions & 6 deletions

File tree

docs/paper/working-memory-cliff.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -204,11 +204,19 @@ All measurements use English wikitext (biographical and encyclopedic prose). Cro
204204

205205
We iterated five formats but finalized on one. A systematic prompt-sensitivity study (template-robustness-as-ceiling-measurement) would strengthen the contribution substantially.
206206

207-
### 5.6 Sampling-noise estimation requires CLI fix
207+
### 5.6 Sampling-noise estimation: CLI bug discovered and fixed mid-round
208208

209-
We initially planned a 60-trial random-seed sweep at the cliff cells (1B Q8 ctx=1024 and 3B Q4 ctx=1280) at temperature 0.7 to estimate sampling noise around the apparent 22 pp delta between FP32 and 6.4× compressed at the 1B cliff. The sweep was blocked by a CLI bug we surfaced during this Karpathy round: `tools/quant.c` documents a `-s <seed>` flag in its `--help` output but **does not implement it** — there is no parser case for `-s`, no `rng_seed` field in the generate config, and the underlying `tq_sample_topp` call hardcodes `rng_state = 42` per CLI invocation. As a result, all 60 sampled trials degenerated to "model path = 42", and we could not vary the sampling seed.
209+
We initially planned a 60-trial random-seed sweep at the cliff cells (1B Q8 ctx=1024 and 3B Q4 ctx=1280) at temperature 0.7 to estimate sampling noise around the apparent 22 pp delta between FP32 and 6.4× compressed at the 1B cliff. The first attempt produced a striking failure: all 60 trials returned `Loading model from <seed>... cannot open '<seed>'`. The cause was a CLI bug we surfaced during this round`tools/quant.c` advertised a `-s <seed>` flag in its `--help` output but the parser had no case for it. The seed argument was silently dropped, and the *next* positional argument (the seed value, e.g. `42`) was bound to the model-path slot. The downstream sampler in `tq_generate` was hardcoded to `rng_state = 42` per CLI invocation, so even if the parser had worked, sampled outputs would have been identical across "different" seeds.
210210

211-
This is independent of the cliff finding (the FP32-weights control in §4.5 is the much stronger result anyway), but it does mean the 1B cliff cell's apparent 22 pp gap between baseline and compressed remains unverified by per-cell sampling noise. With T=0 greedy decoding, all trials are deterministic and the gap reflects only per-(needle, depth) prompt-level variation, not stochastic sampling noise. We file the CLI fix as a separate quant.cpp issue and leave the seed-controlled sampling sweep to v2 of this report.
211+
We fixed both halves of the bug in a separate commit (`a8f6d8a`):
212+
- Added `unsigned long long rng_seed` to `tq_gen_config_t` in `include/turboquant/tq_engine.h` and the single-header `quant.h`.
213+
- Initialised `rng_seed = 42ULL` in `tq_default_gen_config` (back-compat preserving).
214+
- Wired `rng_state = config->rng_seed ? config->rng_seed : 42ULL` in both `src/engine/tq_generate.c` and `quant.h`.
215+
- Added the `-s` parser case in `tools/quant.c`.
216+
217+
After the fix, `-s 42` and `-s 1337` produce demonstrably different outputs at `-T 0.7` (verified manually), and the no-`-s` default is bit-for-bit identical to `-s 42` (verified for backwards compatibility). All 35 build_metal/ tests still pass.
218+
219+
The seed-controlled sampling sweep itself is still pending — running it post-fix is straightforward but was outside the time budget for this round. The 1B cliff cell's apparent 22 pp gap between baseline and compressed therefore remains unverified by per-seed sampling noise. With the CLI fix in place, a v2 of this report can simply re-run `bench/niah_seed_sweep.sh` (the script already exists, just bug-hit at the time of v1 submission).
212220

213221
## 6. Discussion: What "Long-Context Replaces RAG" Actually Means at the Edge
214222

docs/paper/working-memory-cliff.tex

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -388,13 +388,26 @@ \subsection*{5.5 One prompt format}
388388
We iterated five formats but finalized on one. A systematic prompt-sensitivity study (template-robustness-as-ceiling-measurement) would strengthen the contribution substantially.
389389

390390

391-
\subsection*{5.6 Sampling-noise estimation requires CLI fix}
391+
\subsection*{5.6 Sampling-noise estimation: CLI bug discovered and fixed mid-round}
392392

393393

394-
We initially planned a 60-trial random-seed sweep at the cliff cells (1B Q8 ctx=1024 and 3B Q4 ctx=1280) at temperature 0.7 to estimate sampling noise around the apparent 22 pp delta between FP32 and 6.4× compressed at the 1B cliff. The sweep was blocked by a CLI bug we surfaced during this Karpathy round: \texttt{tools/quant.c} documents a \texttt{-s <seed>} flag in its \texttt{--help} output but \textbf{does not implement it} — there is no parser case for \texttt{-s}, no \texttt{rng\_seed} field in the generate config, and the underlying \texttt{tq\_sample\_topp} call hardcodes \texttt{rng\_state = 42} per CLI invocation. As a result, all 60 sampled trials degenerated to "model path = 42", and we could not vary the sampling seed.
394+
We initially planned a 60-trial random-seed sweep at the cliff cells (1B Q8 ctx=1024 and 3B Q4 ctx=1280) at temperature 0.7 to estimate sampling noise around the apparent 22 pp delta between FP32 and 6.4× compressed at the 1B cliff. The first attempt produced a striking failure: all 60 trials returned \texttt{Loading model from <seed>... cannot open '<seed>'}. The cause was a CLI bug we surfaced during this round\texttt{tools/quant.c} advertised a \texttt{-s <seed>} flag in its \texttt{--help} output but the parser had no case for it. The seed argument was silently dropped, and the \textit{next} positional argument (the seed value, e.g. \texttt{42}) was bound to the model-path slot. The downstream sampler in \texttt{tq\_generate} was hardcoded to \texttt{rng\_state = 42} per CLI invocation, so even if the parser had worked, sampled outputs would have been identical across "different" seeds.
395395

396396

397-
This is independent of the cliff finding (the FP32-weights control in §4.5 is the much stronger result anyway), but it does mean the 1B cliff cell's apparent 22 pp gap between baseline and compressed remains unverified by per-cell sampling noise. With T=0 greedy decoding, all trials are deterministic and the gap reflects only per-(needle, depth) prompt-level variation, not stochastic sampling noise. We file the CLI fix as a separate quant.cpp issue and leave the seed-controlled sampling sweep to v2 of this report.
397+
We fixed both halves of the bug in a separate commit (\texttt{a8f6d8a}):
398+
399+
\begin{itemize}
400+
\item Added \texttt{unsigned long long rng\_seed} to \texttt{tq\_gen\_config\_t} in \texttt{include/turboquant/tq\_engine.h} and the single-header \texttt{quant.h}.
401+
\item Initialised \texttt{rng\_seed = 42ULL} in \texttt{tq\_default\_gen\_config} (back-compat preserving).
402+
\item Wired \texttt{rng\_state = config->rng\_seed ? config->rng\_seed : 42ULL} in both \texttt{src/engine/tq\_generate.c} and \texttt{quant.h}.
403+
\item Added the \texttt{-s} parser case in \texttt{tools/quant.c}.
404+
\end{itemize}
405+
406+
407+
After the fix, \texttt{-s 42} and \texttt{-s 1337} produce demonstrably different outputs at \texttt{-T 0.7} (verified manually), and the no-\texttt{-s} default is bit-for-bit identical to \texttt{-s 42} (verified for backwards compatibility). All 35 build\_metal/ tests still pass.
408+
409+
410+
The seed-controlled sampling sweep itself is still pending — running it post-fix is straightforward but was outside the time budget for this round. The 1B cliff cell's apparent 22 pp gap between baseline and compressed therefore remains unverified by per-seed sampling noise. With the CLI fix in place, a v2 of this report can simply re-run \texttt{bench/niah\_seed\_sweep.sh} (the script already exists, just bug-hit at the time of v1 submission).
398411

399412

400413
\section*{6. Discussion: What "Long-Context Replaces RAG" Actually Means at the Edge}

0 commit comments

Comments
 (0)