You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
paper(working-memory-cliff): §5.6 — CLI seed bug now fixed (a8f6d8a)
Updates the limitations section to reflect that the seed-sweep CLI
bug was fixed in the same Karpathy round that discovered it. The
seed-controlled sampling sweep itself is still pending (out of time
budget for v1) but is now unblocked — bench/niah_seed_sweep.sh works
correctly post-fix.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/paper/working-memory-cliff.md
+11-3Lines changed: 11 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -204,11 +204,19 @@ All measurements use English wikitext (biographical and encyclopedic prose). Cro
204
204
205
205
We iterated five formats but finalized on one. A systematic prompt-sensitivity study (template-robustness-as-ceiling-measurement) would strengthen the contribution substantially.
### 5.6 Sampling-noise estimation: CLI bug discovered and fixed mid-round
208
208
209
-
We initially planned a 60-trial random-seed sweep at the cliff cells (1B Q8 ctx=1024 and 3B Q4 ctx=1280) at temperature 0.7 to estimate sampling noise around the apparent 22 pp delta between FP32 and 6.4× compressed at the 1B cliff. The sweep was blocked by a CLI bug we surfaced during this Karpathy round: `tools/quant.c`documents a `-s <seed>` flag in its `--help` output but **does not implement it** — there is no parser case for `-s`, no `rng_seed` field in the generate config, and the underlying `tq_sample_topp` call hardcodes `rng_state = 42` per CLI invocation. As a result, all 60 sampled trials degenerated to "model path = 42", and we could not vary the sampling seed.
209
+
We initially planned a 60-trial random-seed sweep at the cliff cells (1B Q8 ctx=1024 and 3B Q4 ctx=1280) at temperature 0.7 to estimate sampling noise around the apparent 22 pp delta between FP32 and 6.4× compressed at the 1B cliff. The first attempt produced a striking failure: all 60 trials returned `Loading model from <seed>... cannot open '<seed>'`. The cause was a CLI bug we surfaced during this round — `tools/quant.c`advertised a `-s <seed>` flag in its `--help` output but the parser had no case for it. The seed argument was silently dropped, and the *next* positional argument (the seed value, e.g. `42`) was bound to the model-path slot. The downstream sampler in `tq_generate` was hardcoded to `rng_state = 42` per CLI invocation, so even if the parser had worked, sampled outputs would have been identical across "different" seeds.
210
210
211
-
This is independent of the cliff finding (the FP32-weights control in §4.5 is the much stronger result anyway), but it does mean the 1B cliff cell's apparent 22 pp gap between baseline and compressed remains unverified by per-cell sampling noise. With T=0 greedy decoding, all trials are deterministic and the gap reflects only per-(needle, depth) prompt-level variation, not stochastic sampling noise. We file the CLI fix as a separate quant.cpp issue and leave the seed-controlled sampling sweep to v2 of this report.
211
+
We fixed both halves of the bug in a separate commit (`a8f6d8a`):
212
+
- Added `unsigned long long rng_seed` to `tq_gen_config_t` in `include/turboquant/tq_engine.h` and the single-header `quant.h`.
213
+
- Initialised `rng_seed = 42ULL` in `tq_default_gen_config` (back-compat preserving).
214
+
- Wired `rng_state = config->rng_seed ? config->rng_seed : 42ULL` in both `src/engine/tq_generate.c` and `quant.h`.
215
+
- Added the `-s` parser case in `tools/quant.c`.
216
+
217
+
After the fix, `-s 42` and `-s 1337` produce demonstrably different outputs at `-T 0.7` (verified manually), and the no-`-s` default is bit-for-bit identical to `-s 42` (verified for backwards compatibility). All 35 build_metal/ tests still pass.
218
+
219
+
The seed-controlled sampling sweep itself is still pending — running it post-fix is straightforward but was outside the time budget for this round. The 1B cliff cell's apparent 22 pp gap between baseline and compressed therefore remains unverified by per-seed sampling noise. With the CLI fix in place, a v2 of this report can simply re-run `bench/niah_seed_sweep.sh` (the script already exists, just bug-hit at the time of v1 submission).
212
220
213
221
## 6. Discussion: What "Long-Context Replaces RAG" Actually Means at the Edge
Copy file name to clipboardExpand all lines: docs/paper/working-memory-cliff.tex
+16-3Lines changed: 16 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -388,13 +388,26 @@ \subsection*{5.5 One prompt format}
388
388
We iterated five formats but finalized on one. A systematic prompt-sensitivity study (template-robustness-as-ceiling-measurement) would strengthen the contribution substantially.
\subsection*{5.6 Sampling-noise estimation: CLI bug discovered and fixed mid-round}
392
392
393
393
394
-
We initially planned a 60-trial random-seed sweep at the cliff cells (1B Q8 ctx=1024 and 3B Q4 ctx=1280) at temperature 0.7 to estimate sampling noise around the apparent 22 pp delta between FP32 and 6.4× compressed at the 1B cliff. The sweep was blocked by a CLI bug we surfaced during this Karpathy round: \texttt{tools/quant.c} documents a \texttt{-s <seed>} flag in its \texttt{--help} output but \textbf{does not implement it} — there is no parser case for \texttt{-s}, no\texttt{rng\_seed} field in the generate config, and the underlying \texttt{tq\_sample\_topp} call hardcodes \texttt{rng\_state = 42} per CLI invocation. As a result, all 60 sampled trials degenerated to "model path = 42", and we could not vary the sampling seed.
394
+
We initially planned a 60-trial random-seed sweep at the cliff cells (1B Q8 ctx=1024 and 3B Q4 ctx=1280) at temperature 0.7 to estimate sampling noise around the apparent 22 pp delta between FP32 and 6.4× compressed at the 1B cliff. The first attempt produced a striking failure: all 60 trials returned \texttt{Loading model from <seed>... cannot open '<seed>'}. The cause was a CLI bug we surfaced during this round — \texttt{tools/quant.c} advertised a \texttt{-s <seed>} flag in its \texttt{--help} output but the parser had no case for it. The seed argument was silently dropped, and the \textit{next} positional argument (the seed value, e.g.\texttt{42}) was bound to the model-path slot. The downstream sampler in \texttt{tq\_generate} was hardcoded to \texttt{rng\_state = 42} per CLI invocation, so even if the parser had worked, sampled outputs would have been identical across "different" seeds.
395
395
396
396
397
-
This is independent of the cliff finding (the FP32-weights control in §4.5 is the much stronger result anyway), but it does mean the 1B cliff cell's apparent 22 pp gap between baseline and compressed remains unverified by per-cell sampling noise. With T=0 greedy decoding, all trials are deterministic and the gap reflects only per-(needle, depth) prompt-level variation, not stochastic sampling noise. We file the CLI fix as a separate quant.cpp issue and leave the seed-controlled sampling sweep to v2 of this report.
397
+
We fixed both halves of the bug in a separate commit (\texttt{a8f6d8a}):
398
+
399
+
\begin{itemize}
400
+
\item Added \texttt{unsigned long long rng\_seed} to \texttt{tq\_gen\_config\_t} in \texttt{include/turboquant/tq\_engine.h} and the single-header \texttt{quant.h}.
401
+
\item Initialised \texttt{rng\_seed = 42ULL} in \texttt{tq\_default\_gen\_config} (back-compat preserving).
402
+
\item Wired \texttt{rng\_state = config->rng\_seed ? config->rng\_seed : 42ULL} in both \texttt{src/engine/tq\_generate.c} and \texttt{quant.h}.
403
+
\item Added the \texttt{-s} parser case in \texttt{tools/quant.c}.
404
+
\end{itemize}
405
+
406
+
407
+
After the fix, \texttt{-s 42} and \texttt{-s 1337} produce demonstrably different outputs at \texttt{-T 0.7} (verified manually), and the no-\texttt{-s} default is bit-for-bit identical to \texttt{-s 42} (verified for backwards compatibility). All 35 build\_metal/ tests still pass.
408
+
409
+
410
+
The seed-controlled sampling sweep itself is still pending — running it post-fix is straightforward but was outside the time budget for this round. The 1B cliff cell's apparent 22 pp gap between baseline and compressed therefore remains unverified by per-seed sampling noise. With the CLI fix in place, a v2 of this report can simply re-run \texttt{bench/niah\_seed\_sweep.sh} (the script already exists, just bug-hit at the time of v1 submission).
398
411
399
412
400
413
\section*{6. Discussion: What "Long-Context Replaces RAG" Actually Means at the Edge}
0 commit comments