You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
phase 3 day 2: D2 gate PASSED — clean first-pass on Acme smoke test
The smoke test now returns a CONFIDENT answer of "John Williamlims"
(= John Williams + Q4 visual jitter) on the first attempt with zero
retries. Total time: 77 seconds end-to-end including 47s server start.
Stage 1 GIST -> 3 chunks of 500 chars, head_text + regex entities
Stage 2 LOCATOR -> chunk 0 (parser fallback; works for smoke test)
Stage 3 LOOKUP -> "The cchieef finnaancial ofoffficcer (CCF) of Acme Robbottic is John Williamlims."
Stage 4 VERIFY -> CONFIDENT (literal: 5/6 key terms found in region)
Stage 5 RESEARCH -> not invoked (no retry needed)
Stage 6 OUTPUT -> high confidence, returned directly
Four redesigns made the gate pass:
1. Gist no-LLM path (stages/gist.py)
- Added head_text field (first 200 chars of chunk text)
- Regex entity extraction (capitalized words + numbers + acronyms)
- LLM summary path made optional (use_llm=False default)
- to_outline_text() now uses head_text as primary locator signal
- Day 1's LLM-generated summaries were too generic ("This section is
about Acme Robotics, a company that...") to discriminate; raw text
is a much better locator signal.
2. Citation-grounded verifier (stages/verifier.py)
- Pivoted from "verify against gist summary" to "verify against the
actual lookup region text"
- _literal_verify(): extracts key terms (multi-cap names, single-cap
words, numbers) from the answer and checks each against the region
using fuzzy substring matching with per-word prefix fallback
- _fuzzy_word_in_region(): handles Q4 visual jitter — "Williams"
matches "williamlims" via the 5-char prefix "willi"
- _fuzzy_in_region(): multi-word terms require ≥50% of words to match
- LLM verifier kept as fallback when literal check is ambiguous
3. Acronym expansion in orchestrator (rlv_orchestrator.py)
- Diagnosed by direct test: "CFO" under Q4 jitter renders as "ccf"
which the model can't distinguish from "ceo" → returns the CEO
- Added _expand_acronyms() table for CFO/CEO/CTO/COO/CIO/CMO/CDO/HR/
R&D/IPO; expands to "chief financial officer (CFO)" etc.
- Verified by side-by-side: same model on same chunk now returns
"John Williamlims" instead of "Maria Santos"
4. Extractive lookup prompt (stages/lookup.py)
- Reframed from "answer the question" to "Quote the single sentence
from the text above that answers this question"
- Extractive framing forces span selection over summarisation, which
sidesteps the Phase 2B primacy bias (model picking the first
mentioned entity rather than the question target)
Day 2 lessons embedded in code comments:
- Q4 visual jitter on ALL-CAPS acronyms is a real failure mode —
preprocessing acronym expansion is required for business-domain RAG.
- LLM-generated gist summaries are too generic to discriminate; raw
chunk head_text is a better locator signal.
- Citation-grounded verification (read the actual region) is much
more reliable than gist-summary verification.
- Extractive lookup framing sidesteps the Phase 2B primacy bias.
- Per-word fuzzy matching with prefix fallback handles Q4 jitter
on identifiers reliably.
Open issue for Day 3: locator parser still fails on most calls (model
emits "## Step 1: ..." reasoning chains even with the system prompt).
Currently falls back to chunk 0, which happened to contain the answer
for the smoke test but won't generalise. D3 plan: redesign locator
with non-LLM hybrid (keyword overlap with chunk head_text) as primary
signal, LLM as fallback. Then run the v0.12 Acme 7-question benchmark.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments