Skip to content

Commit 28af5f6

Browse files
unamedkrclaude
andcommitted
phase 3 day 1: RLV harness skeleton + project doc + integration gate
Project documentation: - docs/phase3_rlv_challenge.md (canonical source of truth, 350+ lines) - Problem framing (RAG silent hallucination + long-context cliff) - 5-stage architecture mapped from human cognitive retrieval pattern - Why this is feasible *now* with quant.cpp specifically - Day 1-7 plan with Karpathy gates per day - Files-and-directory layout - Reading order if a future Claude Code session loads this project - Day 1 Karpathy log section - ~/.claude/projects/.../memory/project_phase3_rlv.md (memory entry) - MEMORY.md updated with Phase 3 RLV as the top item Harness skeleton (bench/rlv/): - README.md pointing back to the project doc - rlv_orchestrator.py: 5-stage flow controller, end-to-end answer_question - stages/_llm.py: shared HTTP client for quant-server - start_server() / stop_server() lifecycle - check_cliff_budget() enforcing the cliff invariant - llm_call() via /v1/chat/completions with tolerant system prompt - stages/gist.py: Stage 1 — chunked summarisation (~500 char chunks) - stages/locator.py: Stage 2 — outline + question -> chunk pointer - stages/lookup.py: Stage 3 — region + question -> answer (minimal prompt) - stages/verifier.py: Stage 4 — gist + answer -> {confident, unsure, contradicted} - stages/researcher.py: Stage 5 — retry with different region (max 3) - tests/smoke_test.py: D1 gate — orchestrator on a 4-section synthetic doc D1 gate result: - Integration ✅: pipeline runs end-to-end without crashing - Accuracy ❌: smoke test picks wrong entity due to gist summaries being too vague for the locator to discriminate. The model picks the first entity mentioned in the read region (Maria Santos / CEO) instead of the question target (John Williams / CFO). This is exactly the Phase 2B primacy-bias failure mode — and it's what RLV is supposed to fix by isolating chunks to single-section reads. Lessons surfaced and embedded in code comments: - Subprocess stdout/stderr need stderr=STDOUT to merge in bash 2>&1 order so the model output parser can find the --- delimiters. - Llama-3.2-3B-Q4 in chat mode emits "## Step 1: ..." reasoning chains unless given a short, direct system prompt. Fighting structured formats (TOPICS:/CHUNK:/VERDICT:) is counterproductive — use direct natural-language questions and tolerant parsers instead. - Server-based architecture (quant-server) is mandatory: per-call subprocess start = ~50s model reload overhead = 5 min per question for a 5-stage pipeline. With the server it's ~10s per call. - Primacy bias kicks in at sub-cliff sizes too. Chunking even small docs to ~500 chars is necessary for RLV's locator to have anything to choose between. D2 plan (next): pivot the locator to use first-100-char chunk snippets as the index instead of model-written summaries. Direct text extraction beats LLM summarization for the indexing signal that the locator needs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 7646711 commit 28af5f6

11 files changed

Lines changed: 1295 additions & 0 deletions

File tree

bench/rlv/README.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# RLV — Read-Locate-Verify document QA
2+
3+
A 5-stage human-cognition-inspired document QA architecture built on top of `quant.cpp`. The challenge, motivation, architecture, and project plan are in **[`docs/phase3_rlv_challenge.md`](../../docs/phase3_rlv_challenge.md)** at the repo root — read that first if you've never seen this work.
4+
5+
## Quickstart
6+
7+
```bash
8+
# From the repo root
9+
python3 bench/rlv/rlv_orchestrator.py \
10+
--doc bench/data/wikitext2_test.txt \
11+
--question "Who is Robert Boulter?" \
12+
--model models/Llama-3.2-3B-Instruct-Q8_0.gguf
13+
```
14+
15+
## Layout
16+
17+
```
18+
bench/rlv/
19+
├── README.md # this file
20+
├── rlv_orchestrator.py # main entry point
21+
├── stages/
22+
│ ├── __init__.py
23+
│ ├── gist.py # Stage 1: chunked summarisation → outline
24+
│ ├── locator.py # Stage 2: outline + question → region pointer
25+
│ ├── lookup.py # Stage 3: region.kv + question → answer
26+
│ ├── verifier.py # Stage 4: gist + answer → verdict
27+
│ └── researcher.py # Stage 5: retry with a different region
28+
├── prompts/ # template prompts (gist/locator/lookup/verify)
29+
├── eval/
30+
│ ├── eval_acme.py # D3: v0.12 Acme reproduction
31+
│ └── eval_stress.py # D5: 8000-token stress test
32+
└── tests/
33+
└── smoke_test.py
34+
```
35+
36+
## Cliff invariant
37+
38+
Every stage's prompt MUST be ≤ **1024 tokens** for Llama-3.2-3B-Q4 (the cliff measured in Phase 1B). The orchestrator enforces this in `_check_cliff_budget()`.

bench/rlv/rlv_orchestrator.py

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
#!/usr/bin/env python3
2+
"""RLV (Read-Locate-Verify) document QA orchestrator.
3+
4+
Implements the 5-stage architecture from docs/phase3_rlv_challenge.md:
5+
6+
Stage 1 GIST — chunked summarisation pass → structured outline
7+
Stage 2 LOCATOR — outline + question → region pointer
8+
Stage 3 LOOKUP — region + question → tentative answer
9+
Stage 4 VERIFY — gist + answer → {confident, unsure, contradicted}
10+
Stage 5 RESEARCH — retry with different region if verify fails
11+
Stage 6 OUTPUT — calibrated final answer (confident or explicit uncertainty)
12+
13+
Cliff invariant (see docs/phase3_rlv_challenge.md §3.2): every stage's
14+
prompt must be ≤ 1024 tokens for Llama-3.2-3B-Q4. The harness enforces
15+
this through stages._llm.check_cliff_budget().
16+
17+
Usage:
18+
python3 bench/rlv/rlv_orchestrator.py \\
19+
--doc bench/data/wikitext2_test.txt \\
20+
--question "Who is Robert Boulter?"
21+
22+
For evals see bench/rlv/eval/.
23+
"""
24+
import argparse
25+
import json
26+
import sys
27+
import time
28+
from dataclasses import asdict
29+
from pathlib import Path
30+
31+
# Make 'stages' importable when running from anywhere
32+
sys.path.insert(0, str(Path(__file__).resolve().parent))
33+
34+
from stages import gist as gist_stage
35+
from stages import locator as locator_stage
36+
from stages import lookup as lookup_stage
37+
from stages import verifier as verifier_stage
38+
from stages import researcher as researcher_stage
39+
40+
41+
def answer_question(
42+
doc_text: str,
43+
question: str,
44+
*,
45+
doc_id: str = "doc",
46+
cached_gist: gist_stage.Gist = None,
47+
verbose: bool = True,
48+
) -> dict:
49+
"""Run the full RLV pipeline. Returns a dict with the final answer
50+
and per-stage diagnostic info."""
51+
t_start = time.time()
52+
timings = {}
53+
54+
# Stage 1: GIST (or use cached one)
55+
t0 = time.time()
56+
if cached_gist is not None:
57+
gist = cached_gist
58+
if verbose:
59+
print(f"[stage 1] using cached gist ({len(gist.chunks)} chunks)")
60+
else:
61+
if verbose:
62+
print(f"[stage 1] building gist for doc_id={doc_id}, len={len(doc_text)} chars")
63+
gist = gist_stage.build_gist(doc_text, doc_id=doc_id, verbose=verbose)
64+
timings["stage1_gist"] = time.time() - t0
65+
66+
# Stage 2: LOCATOR
67+
t0 = time.time()
68+
if verbose:
69+
print(f"[stage 2] locating question: {question!r}")
70+
region = locator_stage.locate(question, gist, verbose=verbose)
71+
timings["stage2_locator"] = time.time() - t0
72+
if verbose:
73+
print(f"[stage 2] -> chunk {region.chunk_id} (confidence={region.confidence})")
74+
75+
# Stage 3: LOOKUP
76+
t0 = time.time()
77+
if verbose:
78+
print(f"[stage 3] reading chunk {region.chunk_id}")
79+
look = lookup_stage.lookup(question, region, doc_text, verbose=verbose)
80+
timings["stage3_lookup"] = time.time() - t0
81+
if verbose:
82+
print(f"[stage 3] -> answer: {look.answer[:80]!r}")
83+
84+
# Stage 4: VERIFY
85+
t0 = time.time()
86+
if verbose:
87+
print(f"[stage 4] verifying against gist")
88+
ver = verifier_stage.verify(question, look.answer, gist, verbose=verbose)
89+
timings["stage4_verifier"] = time.time() - t0
90+
if verbose:
91+
print(f"[stage 4] -> verdict: {ver.verdict} ({ver.reason})")
92+
93+
# Stage 5: RESEARCH (only if verify failed)
94+
t0 = time.time()
95+
research = researcher_stage.research(
96+
question, look, ver, gist, doc_text, verbose=verbose,
97+
)
98+
timings["stage5_research"] = time.time() - t0
99+
100+
# Stage 6: OUTPUT — format the final answer based on the verdict
101+
if research.final_verdict == "CONFIDENT":
102+
final_text = research.final_answer
103+
confidence = "high"
104+
elif research.final_verdict == "EXHAUSTED":
105+
final_text = (
106+
f"I'm not fully confident in any answer to your question. The closest "
107+
f"information I found is: {research.final_answer}"
108+
)
109+
confidence = "low"
110+
else:
111+
final_text = research.final_answer
112+
confidence = "medium"
113+
114+
timings["total"] = time.time() - t_start
115+
116+
return {
117+
"question": question,
118+
"final_answer": final_text,
119+
"confidence": confidence,
120+
"research": {
121+
"verdict": research.final_verdict,
122+
"n_retries": research.n_retries,
123+
"attempts": research.attempts,
124+
},
125+
"timings": timings,
126+
"gist_n_chunks": len(gist.chunks),
127+
}
128+
129+
130+
def main():
131+
parser = argparse.ArgumentParser(description=__doc__)
132+
parser.add_argument("--doc", required=True, type=Path,
133+
help="Path to the document text file")
134+
parser.add_argument("--question", required=True, type=str,
135+
help="The question to answer")
136+
parser.add_argument("--doc-id", default=None, type=str)
137+
parser.add_argument("--quiet", action="store_true",
138+
help="Suppress per-stage diagnostics")
139+
parser.add_argument("--json", action="store_true",
140+
help="Output JSON instead of human text")
141+
args = parser.parse_args()
142+
143+
doc_text = args.doc.read_text(encoding="utf-8", errors="replace")
144+
doc_id = args.doc_id or args.doc.stem
145+
146+
result = answer_question(
147+
doc_text, args.question,
148+
doc_id=doc_id, verbose=not args.quiet,
149+
)
150+
151+
if args.json:
152+
print(json.dumps(result, indent=2, default=str))
153+
else:
154+
print("\n" + "=" * 70)
155+
print(f"QUESTION: {result['question']}")
156+
print(f"ANSWER: {result['final_answer']}")
157+
print(f"CONFIDENCE: {result['confidence']}")
158+
print(f"VERDICT: {result['research']['verdict']}")
159+
print(f"RETRIES: {result['research']['n_retries']}")
160+
print(f"GIST CHUNKS: {result['gist_n_chunks']}")
161+
print(f"TOTAL TIME: {result['timings']['total']:.1f}s")
162+
print(" " + " | ".join(
163+
f"{k}={v:.1f}s" for k, v in result["timings"].items() if k != "total"
164+
))
165+
print("=" * 70)
166+
167+
168+
if __name__ == "__main__":
169+
main()

bench/rlv/stages/__init__.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
"""RLV stages — see docs/phase3_rlv_challenge.md §3.1 for the full architecture.
2+
3+
Stage layering (each stage depends on the previous):
4+
gist → produces a structured outline (Stage 1)
5+
locator → outline + question → region pointer (Stage 2)
6+
lookup → region + question → tentative answer (Stage 3)
7+
verifier → gist + answer → verdict (Stage 4)
8+
researcher → retry locator with a different region (Stage 5)
9+
"""
10+
from . import gist
11+
from . import locator
12+
from . import lookup
13+
from . import verifier
14+
from . import researcher
15+
16+
__all__ = ["gist", "locator", "lookup", "verifier", "researcher"]

0 commit comments

Comments
 (0)