Skip to content

Commit 361751d

Browse files
unamedkrclaude
andcommitted
phase 3 day 2: D2 gate PASSED — clean first-pass on Acme smoke test
The smoke test now returns a CONFIDENT answer of "John Williamlims" (= John Williams + Q4 visual jitter) on the first attempt with zero retries. Total time: 77 seconds end-to-end including 47s server start. Stage 1 GIST -> 3 chunks of 500 chars, head_text + regex entities Stage 2 LOCATOR -> chunk 0 (parser fallback; works for smoke test) Stage 3 LOOKUP -> "The cchieef finnaancial ofoffficcer (CCF) of Acme Robbottic is John Williamlims." Stage 4 VERIFY -> CONFIDENT (literal: 5/6 key terms found in region) Stage 5 RESEARCH -> not invoked (no retry needed) Stage 6 OUTPUT -> high confidence, returned directly Four redesigns made the gate pass: 1. Gist no-LLM path (stages/gist.py) - Added head_text field (first 200 chars of chunk text) - Regex entity extraction (capitalized words + numbers + acronyms) - LLM summary path made optional (use_llm=False default) - to_outline_text() now uses head_text as primary locator signal - Day 1's LLM-generated summaries were too generic ("This section is about Acme Robotics, a company that...") to discriminate; raw text is a much better locator signal. 2. Citation-grounded verifier (stages/verifier.py) - Pivoted from "verify against gist summary" to "verify against the actual lookup region text" - _literal_verify(): extracts key terms (multi-cap names, single-cap words, numbers) from the answer and checks each against the region using fuzzy substring matching with per-word prefix fallback - _fuzzy_word_in_region(): handles Q4 visual jitter — "Williams" matches "williamlims" via the 5-char prefix "willi" - _fuzzy_in_region(): multi-word terms require ≥50% of words to match - LLM verifier kept as fallback when literal check is ambiguous 3. Acronym expansion in orchestrator (rlv_orchestrator.py) - Diagnosed by direct test: "CFO" under Q4 jitter renders as "ccf" which the model can't distinguish from "ceo" → returns the CEO - Added _expand_acronyms() table for CFO/CEO/CTO/COO/CIO/CMO/CDO/HR/ R&D/IPO; expands to "chief financial officer (CFO)" etc. - Verified by side-by-side: same model on same chunk now returns "John Williamlims" instead of "Maria Santos" 4. Extractive lookup prompt (stages/lookup.py) - Reframed from "answer the question" to "Quote the single sentence from the text above that answers this question" - Extractive framing forces span selection over summarisation, which sidesteps the Phase 2B primacy bias (model picking the first mentioned entity rather than the question target) Day 2 lessons embedded in code comments: - Q4 visual jitter on ALL-CAPS acronyms is a real failure mode — preprocessing acronym expansion is required for business-domain RAG. - LLM-generated gist summaries are too generic to discriminate; raw chunk head_text is a better locator signal. - Citation-grounded verification (read the actual region) is much more reliable than gist-summary verification. - Extractive lookup framing sidesteps the Phase 2B primacy bias. - Per-word fuzzy matching with prefix fallback handles Q4 jitter on identifiers reliably. Open issue for Day 3: locator parser still fails on most calls (model emits "## Step 1: ..." reasoning chains even with the system prompt). Currently falls back to chunk 0, which happened to contain the answer for the smoke test but won't generalise. D3 plan: redesign locator with non-LLM hybrid (keyword overlap with chunk head_text) as primary signal, LLM as fallback. Then run the v0.12 Acme 7-question benchmark. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 28af5f6 commit 361751d

6 files changed

Lines changed: 387 additions & 139 deletions

File tree

bench/rlv/rlv_orchestrator.py

Lines changed: 43 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,40 @@
2323
"""
2424
import argparse
2525
import json
26+
import re
2627
import sys
2728
import time
2829
from dataclasses import asdict
2930
from pathlib import Path
3031

32+
# Day 2 finding: Llama-3.2-3B-Q4 in chat mode confuses ALL-CAPS acronyms
33+
# under Q4 visual jitter. The model renders "CFO" as "ccf" and can't
34+
# distinguish it from "CEO" → "ceoce". Result: asking "Who is the CFO?"
35+
# returns the CEO. Asking "Who is the chief financial officer?" returns
36+
# the right person. We pre-expand common acronyms before sending to any
37+
# stage so the model has the full term to anchor on.
38+
ACRONYM_EXPANSIONS = {
39+
r"\bCFO\b": "chief financial officer (CFO)",
40+
r"\bCEO\b": "chief executive officer (CEO)",
41+
r"\bCTO\b": "chief technology officer (CTO)",
42+
r"\bCOO\b": "chief operating officer (COO)",
43+
r"\bCIO\b": "chief information officer (CIO)",
44+
r"\bCMO\b": "chief marketing officer (CMO)",
45+
r"\bCDO\b": "chief data officer (CDO)",
46+
r"\bHR\b": "human resources (HR)",
47+
r"\bR&D\b": "research and development (R&D)",
48+
r"\bIPO\b": "initial public offering (IPO)",
49+
}
50+
51+
52+
def _expand_acronyms(text: str) -> str:
53+
"""Expand ALL-CAPS acronyms to full term + parenthesised acronym so
54+
the model has both forms to match against (resilient to Q4 jitter)."""
55+
out = text
56+
for pattern, replacement in ACRONYM_EXPANSIONS.items():
57+
out = re.sub(pattern, replacement, out)
58+
return out
59+
3160
# Make 'stages' importable when running from anywhere
3261
sys.path.insert(0, str(Path(__file__).resolve().parent))
3362

@@ -51,6 +80,13 @@ def answer_question(
5180
t_start = time.time()
5281
timings = {}
5382

83+
# Pre-process the question: expand acronyms (CFO -> "chief financial
84+
# officer (CFO)") so Q4 visual jitter doesn't confuse the model.
85+
original_question = question
86+
question = _expand_acronyms(question)
87+
if verbose and question != original_question:
88+
print(f"[preprocess] expanded acronyms: {original_question!r} -> {question!r}")
89+
5490
# Stage 1: GIST (or use cached one)
5591
t0 = time.time()
5692
if cached_gist is not None:
@@ -81,11 +117,15 @@ def answer_question(
81117
if verbose:
82118
print(f"[stage 3] -> answer: {look.answer[:80]!r}")
83119

84-
# Stage 4: VERIFY
120+
# Stage 4: VERIFY (citation-grounded against the lookup region)
85121
t0 = time.time()
86122
if verbose:
87-
print(f"[stage 4] verifying against gist")
88-
ver = verifier_stage.verify(question, look.answer, gist, verbose=verbose)
123+
print(f"[stage 4] verifying answer against region (citation-grounded)")
124+
ver = verifier_stage.verify(
125+
question, look.answer, gist,
126+
region_text=look.region_text,
127+
verbose=verbose,
128+
)
89129
timings["stage4_verifier"] = time.time() - t0
90130
if verbose:
91131
print(f"[stage 4] -> verdict: {ver.verdict} ({ver.reason})")

bench/rlv/stages/gist.py

Lines changed: 94 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -1,64 +1,59 @@
11
"""Stage 1: GIST.
22
3-
Read the document chunk by chunk (each chunk sized below the cliff budget)
4-
and produce a structured outline. The outline is small (~500-2000 tokens
5-
for any-size document) and serves as the *index* for stages 2 and 4.
6-
7-
Output schema (one entry per chunk):
8-
[
9-
{
10-
"chunk_id": 0,
11-
"char_start": 0,
12-
"char_end": 3000,
13-
"topics": ["intro", "motivation"],
14-
"key_facts": ["released July 2023", "three sizes 7B/13B/70B"],
15-
"summary": "Introduction to Llama 2 and its motivation."
16-
},
17-
...
18-
]
3+
Build a lightweight index of a document. Each chunk gets:
4+
- char_start, char_end : where in the doc
5+
- head_text : first ~150 chars of the actual chunk text
6+
- entities : regex-extracted capitalized words + numbers
7+
- summary (optional, LLM) : one-sentence model-written summary
8+
9+
Day 1 lesson: model-written gist summaries are too generic to discriminate
10+
("This section is about Acme Robotics, a company that..." for every chunk).
11+
The locator's primary signal is the actual chunk *content*, not a model
12+
summary. We extract the first ~150 chars of each chunk and let the locator
13+
match the question against real text.
14+
15+
The LLM-generated summary path is still available via use_llm=True for cases
16+
where the chunk's head text isn't representative of the section content
17+
(e.g., a chunk that starts mid-sentence). For the prototype, the no-LLM
18+
path (use_llm=False) is the default — much faster, much more discriminating.
1919
"""
20+
import re
2021
from dataclasses import dataclass, field, asdict
2122
from typing import List
2223

2324
from . import _llm
2425

2526
# Chunk size in characters. Two constraints:
26-
# 1. Cliff-safe: each chunk + gist prompt template must fit below ~1024 tokens
27+
# 1. Cliff-safe: each chunk + lookup prompt template must fit below ~1024 tokens
2728
# 2. Primacy-bias-safe: each chunk should be small enough that when stage 3
2829
# LOOKUP reads ONE chunk, the model doesn't pick the first-mentioned
2930
# entity over the question-relevant one. Phase 2B showed this bias kicks
3031
# in even well below the cliff. Empirically ~500 chars works.
3132
# 500 chars ≈ 165 tokens — both constraints satisfied.
3233
CHUNK_CHARS = 500
3334

35+
# How many leading characters of each chunk to include in the locator's
36+
# index. Long enough to capture the section's topic, short enough that
37+
# 10 chunks of head_text fit in one locator prompt below the cliff.
38+
HEAD_TEXT_CHARS = 200
3439

35-
# Use direct natural-language questions instead of structured format
36-
# requests — Llama-3.2-3B-Q4 in chat mode emits reasoning chains when
37-
# asked for structured output but answers cleanly to direct questions.
38-
# We make TWO small calls per chunk (summary + entities) and parse the
39-
# free-text responses with the tolerant extractor below.
40+
41+
# Optional LLM-summary path (off by default in prototype)
4042
GIST_SUMMARY_PROMPT = """Below is one section of a longer document.
4143
4244
{chunk}
4345
4446
In one short sentence, what is this section about? What are the main people, places, or facts mentioned?"""
4547

46-
GIST_ENTITIES_PROMPT = """Below is one section of a longer document.
47-
48-
{chunk}
49-
50-
List the named people, organizations, places, dates, and numbers mentioned in this section. Comma-separated, no other text."""
51-
5248

5349
@dataclass
5450
class GistChunk:
5551
chunk_id: int
5652
char_start: int
5753
char_end: int
58-
topics: List[str] = field(default_factory=list)
54+
head_text: str = "" # first ~200 chars of the chunk's actual text — locator's primary signal
5955
entities: List[str] = field(default_factory=list)
60-
facts: List[str] = field(default_factory=list)
61-
summary: str = ""
56+
summary: str = "" # optional LLM-generated summary
6257

6358
def to_dict(self):
6459
return asdict(self)
@@ -71,55 +66,53 @@ class Gist:
7166
chunks: List[GistChunk]
7267

7368
def to_outline_text(self) -> str:
74-
"""Render the gist as a compact text outline that fits in another
75-
LLM prompt (used by Stage 2 locator and Stage 4 verifier)."""
69+
"""Render the gist as a compact text outline that the locator
70+
will use to pick a chunk. Day 2 design: use head_text as the
71+
primary discriminator, not the model-written summary."""
7672
lines = []
7773
for c in self.chunks:
78-
lines.append(f"[chunk {c.chunk_id}, chars {c.char_start}-{c.char_end}]")
79-
if c.topics: lines.append(f" topics: {', '.join(c.topics)}")
80-
if c.entities: lines.append(f" entities: {', '.join(c.entities)}")
81-
if c.facts: lines.append(f" facts: {', '.join(c.facts)}")
82-
if c.summary: lines.append(f" summary: {c.summary}")
74+
# Compact one-line representation: chunk_id followed by the
75+
# head text (which contains real terms the locator can match
76+
# against the question).
77+
head = c.head_text.replace("\n", " ").strip()
78+
if len(head) > HEAD_TEXT_CHARS:
79+
head = head[:HEAD_TEXT_CHARS] + "…"
80+
lines.append(f"[{c.chunk_id}] {head}")
8381
return "\n".join(lines)
8482

8583

86-
def _parse_summary_response(text: str) -> str:
87-
"""Take the first non-empty sentence as the summary."""
88-
text = text.strip()
89-
# If model still emitted "## Step 1:" reasoning, take everything after the
90-
# last "##" line and treat as summary.
91-
if "## Step" in text:
92-
parts = text.split("\n")
93-
non_step = [l for l in parts if not l.strip().startswith("##")]
94-
text = " ".join(non_step).strip()
95-
# Take the first sentence (period-terminated)
96-
first_period = text.find(". ")
97-
if first_period > 0 and first_period < 200:
98-
return text[:first_period + 1]
99-
return text[:200]
100-
101-
102-
def _parse_entities_response(text: str) -> list[str]:
103-
"""Extract a comma-separated entity list from a free-text response."""
104-
# Strip any preamble like "Here are the entities:" etc.
105-
text = text.strip()
106-
if ":" in text and len(text.split(":", 1)[0]) < 60:
107-
text = text.split(":", 1)[1]
108-
# Take only the first line; some models wrap with extra explanation
109-
text = text.split("\n", 1)[0]
110-
items = [t.strip().rstrip(".,;") for t in text.split(",")]
111-
return [i for i in items if 1 < len(i) < 60][:12]
84+
def _extract_entities(text: str) -> List[str]:
85+
"""Regex-based entity extraction. No LLM call. Captures capitalized
86+
multi-word names + standalone numbers + dates. Tolerant; won't get
87+
everything but produces useful index terms cheaply."""
88+
# Capitalized 1-3 word sequences (names, places, orgs)
89+
cap_seq = re.findall(r"\b[A-Z][a-zA-Z]+(?:\s+[A-Z][a-zA-Z]+){0,2}\b", text)
90+
# Standalone numbers (years, amounts, counts)
91+
nums = re.findall(r"\b\d{2,5}\b", text)
92+
# ALL-CAPS acronyms (CEO, CFO, CTO, etc.)
93+
acronyms = re.findall(r"\b[A-Z]{2,5}\b", text)
94+
# Combine, dedupe, cap at 12
95+
seen = set()
96+
out = []
97+
for item in cap_seq + acronyms + nums:
98+
item = item.strip()
99+
if item and item.lower() not in seen and len(item) > 1:
100+
seen.add(item.lower())
101+
out.append(item)
102+
if len(out) >= 12:
103+
break
104+
return out
112105

113106

114107
def chunk_document(doc_text: str, chunk_chars: int = CHUNK_CHARS) -> List[tuple]:
115-
"""Split a document into cliff-safe chunks at sentence boundaries.
108+
"""Split a document into chunks at sentence boundaries.
116109
Returns a list of (start, end, text) tuples."""
117110
chunks = []
118111
pos = 0
119112
n = len(doc_text)
120113
while pos < n:
121114
end = min(pos + chunk_chars, n)
122-
# Snap to next sentence boundary
115+
# Snap to next sentence boundary so chunks aren't cut mid-sentence
123116
if end < n:
124117
sb_next = doc_text.find(". ", end)
125118
if sb_next > 0 and sb_next - end < 300:
@@ -134,37 +127,59 @@ def build_gist(
134127
doc_id: str = "doc",
135128
*,
136129
chunk_chars: int = CHUNK_CHARS,
130+
use_llm: bool = False,
137131
verbose: bool = False,
138132
) -> Gist:
139-
"""Build a gist of a document by running Stage 1 over each chunk."""
133+
"""Build a gist of a document.
134+
135+
Default (use_llm=False): no LLM calls. Just chunk the text, store
136+
head_text and regex-extracted entities. Fast and discriminating.
137+
138+
With use_llm=True: also generate a one-sentence summary per chunk
139+
via an LLM call. Costs N extra LLM calls per document but produces
140+
a richer index for cases where the chunk head text isn't
141+
representative of the section.
142+
"""
140143
chunks_raw = chunk_document(doc_text, chunk_chars=chunk_chars)
141144
if verbose:
142-
print(f"[gist] doc_id={doc_id} len={len(doc_text)} chars, {len(chunks_raw)} chunks")
145+
print(f"[gist] doc_id={doc_id} len={len(doc_text)} chars, {len(chunks_raw)} chunks "
146+
f"({'with LLM summary' if use_llm else 'no-LLM'})")
143147

144148
out_chunks = []
145149
for i, (start, end, chunk_text) in enumerate(chunks_raw):
146-
# Stage 1a: free-text summary
147-
s_prompt = GIST_SUMMARY_PROMPT.format(chunk=chunk_text)
148-
s_result = _llm.llm_call(s_prompt, max_tokens=80)
149-
summary = _parse_summary_response(s_result.text)
150+
head_text = chunk_text[:HEAD_TEXT_CHARS].strip()
151+
entities = _extract_entities(chunk_text)
150152

151-
# Stage 1b: entity list
152-
e_prompt = GIST_ENTITIES_PROMPT.format(chunk=chunk_text)
153-
e_result = _llm.llm_call(e_prompt, max_tokens=80)
154-
entities = _parse_entities_response(e_result.text)
153+
summary = ""
154+
if use_llm:
155+
s_prompt = GIST_SUMMARY_PROMPT.format(chunk=chunk_text)
156+
s_result = _llm.llm_call(s_prompt, max_tokens=80)
157+
summary = _parse_summary_response(s_result.text)
155158

156159
gc = GistChunk(
157160
chunk_id=i,
158161
char_start=start,
159162
char_end=end,
160-
topics=[], # not used in current design — kept for schema stability
163+
head_text=head_text,
161164
entities=entities,
162-
facts=[], # subsumed by summary + entities
163165
summary=summary,
164166
)
165167
out_chunks.append(gc)
166168
if verbose:
167169
print(f"[gist] chunk {i+1}/{len(chunks_raw)}: "
168-
f"entities={entities[:3]}..., summary={summary[:60]!r}")
170+
f"head={head_text[:60]!r}..., entities={entities[:4]}")
169171

170172
return Gist(doc_id=doc_id, n_chars=len(doc_text), chunks=out_chunks)
173+
174+
175+
def _parse_summary_response(text: str) -> str:
176+
"""Take the first non-empty sentence as the summary."""
177+
text = text.strip()
178+
if "## Step" in text:
179+
parts = text.split("\n")
180+
non_step = [l for l in parts if not l.strip().startswith("##")]
181+
text = " ".join(non_step).strip()
182+
first_period = text.find(". ")
183+
if first_period > 0 and first_period < 200:
184+
return text[:first_period + 1]
185+
return text[:200]

bench/rlv/stages/lookup.py

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,17 @@
1212
from .locator import RegionPointer
1313

1414

15-
# EMPIRICAL: this exact format (doc + blank line + question) is what
16-
# worked in the Phase 3 Day 1 isolation test against Llama-3.2-3B-Q4.
17-
# Adding any wrap like "Based ONLY on the text above..." breaks the
18-
# model and causes it to fall back to primacy-bias entity selection.
19-
# Keep this prompt minimal — every word matters.
15+
# Day 2 redesign: reframe lookup as EXTRACTIVE ("find and quote the
16+
# sentence that contains the answer") rather than GENERATIVE ("answer
17+
# the question"). The extractive framing forces the model to do span
18+
# selection, which sidesteps primacy bias — instead of summarising the
19+
# region (which picks the first-mentioned entity) the model has to
20+
# identify the specific sentence that matches the question's keywords.
2021
LOOKUP_PROMPT_TEMPLATE = """{region_text}
2122
22-
{question}"""
23+
Quote the single sentence from the text above that answers this question. Reply with only that sentence, no explanation.
24+
25+
Question: {question}"""
2326

2427

2528
@dataclass

bench/rlv/stages/researcher.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,11 @@ def research(
6565
break
6666

6767
new_lookup = lookup.lookup(question, new_region, doc_text, verbose=verbose)
68-
new_verify = verifier.verify(question, new_lookup.answer, gist, verbose=verbose)
68+
new_verify = verifier.verify(
69+
question, new_lookup.answer, gist,
70+
region_text=new_lookup.region_text,
71+
verbose=verbose,
72+
)
6973

7074
attempts.append({
7175
"chunk": new_lookup.chunk_id,

0 commit comments

Comments
 (0)