phase 3 day 2: D2 gate PASSED — clean first-pass on Acme smoke test

unamedkr · claude · unamedkr · commit 361751d9b698 · 2026-04-12T08:36:30.000+09:00
The smoke test now returns a CONFIDENT answer of "John Williamlims"
(= John Williams + Q4 visual jitter) on the first attempt with zero
retries. Total time: 77 seconds end-to-end including 47s server start.

  Stage 1 GIST     -&gt; 3 chunks of 500 chars, head_text + regex entities
  Stage 2 LOCATOR  -&gt; chunk 0 (parser fallback; works for smoke test)
  Stage 3 LOOKUP   -&gt; "The cchieef finnaancial ofoffficcer (CCF) of Acme Robbottic is John Williamlims."
  Stage 4 VERIFY   -&gt; CONFIDENT (literal: 5/6 key terms found in region)
  Stage 5 RESEARCH -&gt; not invoked (no retry needed)
  Stage 6 OUTPUT   -&gt; high confidence, returned directly

Four redesigns made the gate pass:

1. Gist no-LLM path (stages/gist.py)
   - Added head_text field (first 200 chars of chunk text)
   - Regex entity extraction (capitalized words + numbers + acronyms)
   - LLM summary path made optional (use_llm=False default)
   - to_outline_text() now uses head_text as primary locator signal
   - Day 1's LLM-generated summaries were too generic ("This section is
     about Acme Robotics, a company that...") to discriminate; raw text
     is a much better locator signal.

2. Citation-grounded verifier (stages/verifier.py)
   - Pivoted from "verify against gist summary" to "verify against the
     actual lookup region text"
   - _literal_verify(): extracts key terms (multi-cap names, single-cap
     words, numbers) from the answer and checks each against the region
     using fuzzy substring matching with per-word prefix fallback
   - _fuzzy_word_in_region(): handles Q4 visual jitter — "Williams"
     matches "williamlims" via the 5-char prefix "willi"
   - _fuzzy_in_region(): multi-word terms require ≥50% of words to match
   - LLM verifier kept as fallback when literal check is ambiguous

3. Acronym expansion in orchestrator (rlv_orchestrator.py)
   - Diagnosed by direct test: "CFO" under Q4 jitter renders as "ccf"
     which the model can't distinguish from "ceo" → returns the CEO
   - Added _expand_acronyms() table for CFO/CEO/CTO/COO/CIO/CMO/CDO/HR/
     R&amp;D/IPO; expands to "chief financial officer (CFO)" etc.
   - Verified by side-by-side: same model on same chunk now returns
     "John Williamlims" instead of "Maria Santos"

4. Extractive lookup prompt (stages/lookup.py)
   - Reframed from "answer the question" to "Quote the single sentence
     from the text above that answers this question"
   - Extractive framing forces span selection over summarisation, which
     sidesteps the Phase 2B primacy bias (model picking the first
     mentioned entity rather than the question target)

Day 2 lessons embedded in code comments:
- Q4 visual jitter on ALL-CAPS acronyms is a real failure mode —
  preprocessing acronym expansion is required for business-domain RAG.
- LLM-generated gist summaries are too generic to discriminate; raw
  chunk head_text is a better locator signal.
- Citation-grounded verification (read the actual region) is much
  more reliable than gist-summary verification.
- Extractive lookup framing sidesteps the Phase 2B primacy bias.
- Per-word fuzzy matching with prefix fallback handles Q4 jitter
  on identifiers reliably.

Open issue for Day 3: locator parser still fails on most calls (model
emits "## Step 1: ..." reasoning chains even with the system prompt).
Currently falls back to chunk 0, which happened to contain the answer
for the smoke test but won't generalise. D3 plan: redesign locator
with non-LLM hybrid (keyword overlap with chunk head_text) as primary
signal, LLM as fallback. Then run the v0.12 Acme 7-question benchmark.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/bench/rlv/rlv_orchestrator.py b/bench/rlv/rlv_orchestrator.py
@@ -23,11 +23,40 @@
 """
 import argparse
 import json
+import re
 import sys
 import time
 from dataclasses import asdict
 from pathlib import Path
 
+# Day 2 finding: Llama-3.2-3B-Q4 in chat mode confuses ALL-CAPS acronyms
+# under Q4 visual jitter. The model renders "CFO" as "ccf" and can't
+# distinguish it from "CEO" → "ceoce". Result: asking "Who is the CFO?"
+# returns the CEO. Asking "Who is the chief financial officer?" returns
+# the right person. We pre-expand common acronyms before sending to any
+# stage so the model has the full term to anchor on.
+ACRONYM_EXPANSIONS = {
+    r"\bCFO\b":  "chief financial officer (CFO)",
+    r"\bCEO\b":  "chief executive officer (CEO)",
+    r"\bCTO\b":  "chief technology officer (CTO)",
+    r"\bCOO\b":  "chief operating officer (COO)",
+    r"\bCIO\b":  "chief information officer (CIO)",
+    r"\bCMO\b":  "chief marketing officer (CMO)",
+    r"\bCDO\b":  "chief data officer (CDO)",
+    r"\bHR\b":   "human resources (HR)",
+    r"\bR&D\b":  "research and development (R&D)",
+    r"\bIPO\b":  "initial public offering (IPO)",
+}
+
+
+def _expand_acronyms(text: str) -> str:
+    """Expand ALL-CAPS acronyms to full term + parenthesised acronym so
+    the model has both forms to match against (resilient to Q4 jitter)."""
+    out = text
+    for pattern, replacement in ACRONYM_EXPANSIONS.items():
+        out = re.sub(pattern, replacement, out)
+    return out
+
 # Make 'stages' importable when running from anywhere
 sys.path.insert(0, str(Path(__file__).resolve().parent))
 
@@ -51,6 +80,13 @@ def answer_question(
     t_start = time.time()
     timings = {}
 
+    # Pre-process the question: expand acronyms (CFO -> "chief financial
+    # officer (CFO)") so Q4 visual jitter doesn't confuse the model.
+    original_question = question
+    question = _expand_acronyms(question)
+    if verbose and question != original_question:
+        print(f"[preprocess] expanded acronyms: {original_question!r} -> {question!r}")
+
     # Stage 1: GIST (or use cached one)
     t0 = time.time()
     if cached_gist is not None:
@@ -81,11 +117,15 @@ def answer_question(
     if verbose:
         print(f"[stage 3] -> answer: {look.answer[:80]!r}")
 
-    # Stage 4: VERIFY
+    # Stage 4: VERIFY (citation-grounded against the lookup region)
     t0 = time.time()
     if verbose:
-        print(f"[stage 4] verifying against gist")
-    ver = verifier_stage.verify(question, look.answer, gist, verbose=verbose)
+        print(f"[stage 4] verifying answer against region (citation-grounded)")
+    ver = verifier_stage.verify(
+        question, look.answer, gist,
+        region_text=look.region_text,
+        verbose=verbose,
+    )
     timings["stage4_verifier"] = time.time() - t0
     if verbose:
         print(f"[stage 4] -> verdict: {ver.verdict} ({ver.reason})")
diff --git a/bench/rlv/stages/gist.py b/bench/rlv/stages/gist.py
@@ -1,64 +1,59 @@
 """Stage 1: GIST.
 
-Read the document chunk by chunk (each chunk sized below the cliff budget)
-and produce a structured outline. The outline is small (~500-2000 tokens
-for any-size document) and serves as the *index* for stages 2 and 4.
-
-Output schema (one entry per chunk):
-    [
-      {
-        "chunk_id": 0,
-        "char_start": 0,
-        "char_end": 3000,
-        "topics": ["intro", "motivation"],
-        "key_facts": ["released July 2023", "three sizes 7B/13B/70B"],
-        "summary": "Introduction to Llama 2 and its motivation."
-      },
-      ...
-    ]
+Build a lightweight index of a document. Each chunk gets:
+  - char_start, char_end          : where in the doc
+  - head_text                     : first ~150 chars of the actual chunk text
+  - entities                      : regex-extracted capitalized words + numbers
+  - summary (optional, LLM)       : one-sentence model-written summary
+
+Day 1 lesson: model-written gist summaries are too generic to discriminate
+("This section is about Acme Robotics, a company that..." for every chunk).
+The locator's primary signal is the actual chunk *content*, not a model
+summary. We extract the first ~150 chars of each chunk and let the locator
+match the question against real text.
+
+The LLM-generated summary path is still available via use_llm=True for cases
+where the chunk's head text isn't representative of the section content
+(e.g., a chunk that starts mid-sentence). For the prototype, the no-LLM
+path (use_llm=False) is the default — much faster, much more discriminating.
 """
+import re
 from dataclasses import dataclass, field, asdict
 from typing import List
 
 from . import _llm
 
 # Chunk size in characters. Two constraints:
-# 1. Cliff-safe: each chunk + gist prompt template must fit below ~1024 tokens
+# 1. Cliff-safe: each chunk + lookup prompt template must fit below ~1024 tokens
 # 2. Primacy-bias-safe: each chunk should be small enough that when stage 3
 #    LOOKUP reads ONE chunk, the model doesn't pick the first-mentioned
 #    entity over the question-relevant one. Phase 2B showed this bias kicks
 #    in even well below the cliff. Empirically ~500 chars works.
 # 500 chars ≈ 165 tokens — both constraints satisfied.
 CHUNK_CHARS = 500
 
+# How many leading characters of each chunk to include in the locator's
+# index. Long enough to capture the section's topic, short enough that
+# 10 chunks of head_text fit in one locator prompt below the cliff.
+HEAD_TEXT_CHARS = 200
 
-# Use direct natural-language questions instead of structured format
-# requests — Llama-3.2-3B-Q4 in chat mode emits reasoning chains when
-# asked for structured output but answers cleanly to direct questions.
-# We make TWO small calls per chunk (summary + entities) and parse the
-# free-text responses with the tolerant extractor below.
+
+# Optional LLM-summary path (off by default in prototype)
 GIST_SUMMARY_PROMPT = """Below is one section of a longer document.
 
 {chunk}
 
 In one short sentence, what is this section about? What are the main people, places, or facts mentioned?"""
 
-GIST_ENTITIES_PROMPT = """Below is one section of a longer document.
-
-{chunk}
-
-List the named people, organizations, places, dates, and numbers mentioned in this section. Comma-separated, no other text."""
-
 
 @dataclass
 class GistChunk:
     chunk_id: int
     char_start: int
     char_end: int
-    topics: List[str] = field(default_factory=list)
+    head_text: str = ""           # first ~200 chars of the chunk's actual text — locator's primary signal
     entities: List[str] = field(default_factory=list)
-    facts: List[str] = field(default_factory=list)
-    summary: str = ""
+    summary: str = ""             # optional LLM-generated summary
 
     def to_dict(self):
         return asdict(self)
@@ -71,55 +66,53 @@ class Gist:
     chunks: List[GistChunk]
 
     def to_outline_text(self) -> str:
-        """Render the gist as a compact text outline that fits in another
-        LLM prompt (used by Stage 2 locator and Stage 4 verifier)."""
+        """Render the gist as a compact text outline that the locator
+        will use to pick a chunk. Day 2 design: use head_text as the
+        primary discriminator, not the model-written summary."""
         lines = []
         for c in self.chunks:
-            lines.append(f"[chunk {c.chunk_id}, chars {c.char_start}-{c.char_end}]")
-            if c.topics:   lines.append(f"  topics: {', '.join(c.topics)}")
-            if c.entities: lines.append(f"  entities: {', '.join(c.entities)}")
-            if c.facts:    lines.append(f"  facts: {', '.join(c.facts)}")
-            if c.summary:  lines.append(f"  summary: {c.summary}")
+            # Compact one-line representation: chunk_id followed by the
+            # head text (which contains real terms the locator can match
+            # against the question).
+            head = c.head_text.replace("\n", " ").strip()
+            if len(head) > HEAD_TEXT_CHARS:
+                head = head[:HEAD_TEXT_CHARS] + "…"
+            lines.append(f"[{c.chunk_id}] {head}")
         return "\n".join(lines)
 
 
-def _parse_summary_response(text: str) -> str:
-    """Take the first non-empty sentence as the summary."""
-    text = text.strip()
-    # If model still emitted "## Step 1:" reasoning, take everything after the
-    # last "##" line and treat as summary.
-    if "## Step" in text:
-        parts = text.split("\n")
-        non_step = [l for l in parts if not l.strip().startswith("##")]
-        text = " ".join(non_step).strip()
-    # Take the first sentence (period-terminated)
-    first_period = text.find(". ")
-    if first_period > 0 and first_period < 200:
-        return text[:first_period + 1]
-    return text[:200]
-
-
-def _parse_entities_response(text: str) -> list[str]:
-    """Extract a comma-separated entity list from a free-text response."""
-    # Strip any preamble like "Here are the entities:" etc.
-    text = text.strip()
-    if ":" in text and len(text.split(":", 1)[0]) < 60:
-        text = text.split(":", 1)[1]
-    # Take only the first line; some models wrap with extra explanation
-    text = text.split("\n", 1)[0]
-    items = [t.strip().rstrip(".,;") for t in text.split(",")]
-    return [i for i in items if 1 < len(i) < 60][:12]
+def _extract_entities(text: str) -> List[str]:
+    """Regex-based entity extraction. No LLM call. Captures capitalized
+    multi-word names + standalone numbers + dates. Tolerant; won't get
+    everything but produces useful index terms cheaply."""
+    # Capitalized 1-3 word sequences (names, places, orgs)
+    cap_seq = re.findall(r"\b[A-Z][a-zA-Z]+(?:\s+[A-Z][a-zA-Z]+){0,2}\b", text)
+    # Standalone numbers (years, amounts, counts)
+    nums = re.findall(r"\b\d{2,5}\b", text)
+    # ALL-CAPS acronyms (CEO, CFO, CTO, etc.)
+    acronyms = re.findall(r"\b[A-Z]{2,5}\b", text)
+    # Combine, dedupe, cap at 12
+    seen = set()
+    out = []
+    for item in cap_seq + acronyms + nums:
+        item = item.strip()
+        if item and item.lower() not in seen and len(item) > 1:
+            seen.add(item.lower())
+            out.append(item)
+        if len(out) >= 12:
+            break
+    return out
 
 
 def chunk_document(doc_text: str, chunk_chars: int = CHUNK_CHARS) -> List[tuple]:
-    """Split a document into cliff-safe chunks at sentence boundaries.
+    """Split a document into chunks at sentence boundaries.
     Returns a list of (start, end, text) tuples."""
     chunks = []
     pos = 0
     n = len(doc_text)
     while pos < n:
         end = min(pos + chunk_chars, n)
-        # Snap to next sentence boundary
+        # Snap to next sentence boundary so chunks aren't cut mid-sentence
         if end < n:
             sb_next = doc_text.find(". ", end)
             if sb_next > 0 and sb_next - end < 300:
@@ -134,37 +127,59 @@ def build_gist(
     doc_id: str = "doc",
     *,
     chunk_chars: int = CHUNK_CHARS,
+    use_llm: bool = False,
     verbose: bool = False,
 ) -> Gist:
-    """Build a gist of a document by running Stage 1 over each chunk."""
+    """Build a gist of a document.
+
+    Default (use_llm=False): no LLM calls. Just chunk the text, store
+    head_text and regex-extracted entities. Fast and discriminating.
+
+    With use_llm=True: also generate a one-sentence summary per chunk
+    via an LLM call. Costs N extra LLM calls per document but produces
+    a richer index for cases where the chunk head text isn't
+    representative of the section.
+    """
     chunks_raw = chunk_document(doc_text, chunk_chars=chunk_chars)
     if verbose:
-        print(f"[gist] doc_id={doc_id} len={len(doc_text)} chars, {len(chunks_raw)} chunks")
+        print(f"[gist] doc_id={doc_id} len={len(doc_text)} chars, {len(chunks_raw)} chunks "
+              f"({'with LLM summary' if use_llm else 'no-LLM'})")
 
     out_chunks = []
     for i, (start, end, chunk_text) in enumerate(chunks_raw):
-        # Stage 1a: free-text summary
-        s_prompt = GIST_SUMMARY_PROMPT.format(chunk=chunk_text)
-        s_result = _llm.llm_call(s_prompt, max_tokens=80)
-        summary = _parse_summary_response(s_result.text)
+        head_text = chunk_text[:HEAD_TEXT_CHARS].strip()
+        entities = _extract_entities(chunk_text)
 
-        # Stage 1b: entity list
-        e_prompt = GIST_ENTITIES_PROMPT.format(chunk=chunk_text)
-        e_result = _llm.llm_call(e_prompt, max_tokens=80)
-        entities = _parse_entities_response(e_result.text)
+        summary = ""
+        if use_llm:
+            s_prompt = GIST_SUMMARY_PROMPT.format(chunk=chunk_text)
+            s_result = _llm.llm_call(s_prompt, max_tokens=80)
+            summary = _parse_summary_response(s_result.text)
 
         gc = GistChunk(
             chunk_id=i,
             char_start=start,
             char_end=end,
-            topics=[],   # not used in current design — kept for schema stability
+            head_text=head_text,
             entities=entities,
-            facts=[],    # subsumed by summary + entities
             summary=summary,
         )
         out_chunks.append(gc)
         if verbose:
             print(f"[gist] chunk {i+1}/{len(chunks_raw)}: "
-                  f"entities={entities[:3]}..., summary={summary[:60]!r}")
+                  f"head={head_text[:60]!r}..., entities={entities[:4]}")
 
     return Gist(doc_id=doc_id, n_chars=len(doc_text), chunks=out_chunks)
+
+
+def _parse_summary_response(text: str) -> str:
+    """Take the first non-empty sentence as the summary."""
+    text = text.strip()
+    if "## Step" in text:
+        parts = text.split("\n")
+        non_step = [l for l in parts if not l.strip().startswith("##")]
+        text = " ".join(non_step).strip()
+    first_period = text.find(". ")
+    if first_period > 0 and first_period < 200:
+        return text[:first_period + 1]
+    return text[:200]
diff --git a/bench/rlv/stages/lookup.py b/bench/rlv/stages/lookup.py
@@ -12,14 +12,17 @@
 from .locator import RegionPointer
 
 
-# EMPIRICAL: this exact format (doc + blank line + question) is what
-# worked in the Phase 3 Day 1 isolation test against Llama-3.2-3B-Q4.
-# Adding any wrap like "Based ONLY on the text above..." breaks the
-# model and causes it to fall back to primacy-bias entity selection.
-# Keep this prompt minimal — every word matters.
+# Day 2 redesign: reframe lookup as EXTRACTIVE ("find and quote the
+# sentence that contains the answer") rather than GENERATIVE ("answer
+# the question"). The extractive framing forces the model to do span
+# selection, which sidesteps primacy bias — instead of summarising the
+# region (which picks the first-mentioned entity) the model has to
+# identify the specific sentence that matches the question's keywords.
 LOOKUP_PROMPT_TEMPLATE = """{region_text}
 
-{question}"""
+Quote the single sentence from the text above that answers this question. Reply with only that sentence, no explanation.
+
+Question: {question}"""
 
 
 @dataclass
diff --git a/bench/rlv/stages/researcher.py b/bench/rlv/stages/researcher.py
@@ -65,7 +65,11 @@ def research(
             break
 
         new_lookup = lookup.lookup(question, new_region, doc_text, verbose=verbose)
-        new_verify = verifier.verify(question, new_lookup.answer, gist, verbose=verbose)
+        new_verify = verifier.verify(
+            question, new_lookup.answer, gist,
+            region_text=new_lookup.region_text,
+            verbose=verbose,
+        )
 
         attempts.append({
             "chunk": new_lookup.chunk_id,
diff --git a/bench/rlv/stages/verifier.py b/bench/rlv/stages/verifier.py
diff --git a/docs/phase3_rlv_challenge.md b/docs/phase3_rlv_challenge.md