docs: normalize release truth before release-please cut (#102)

PatrickSys · web-flow · commit 9f6170e4e4d2 · 2026-04-14T23:00:37.000+02:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,32 +2,6 @@
 
 ## Unreleased
 
-## [2.1.0](https://github.com/PatrickSys/codebase-context/compare/v1.9.0...v2.1.0) (2026-04-13)
-
-### Features
-
-- **search:** surface chunk intelligence directly in `search_codebase` results, including symbol identity, scope, signature preview, and compact/full response budgeting
-- **map:** upgrade the conventions map with structural skeleton sections and add `map --export` so the compact map can be written to `CODEBASE_MAP.md`
-- **mcp:** rework multi-project routing so one MCP server can serve multiple projects instead of one hardcoded server entry per repo
-- **mcp:** keep explicit `project` as the fallback when the client does not provide enough project context
-- **mcp:** accept repo paths, subproject paths, and file paths as `project` selectors when routing is ambiguous
-
-### Bug Fixes
-
-- **metadata:** require real dependency evidence plus multiple framework indicators before labeling a repo as Next.js or another specialized framework
-- **reranker:** auto-heal corrupted cross-encoder cache entries and surface degraded reranker state in `searchQuality.rerankerStatus`
-- **benchmarks:** harden comparator lanes for cross-platform execution and keep setup failures explicit instead of silently turning them into claims
-- **search:** auto-heal on corrupted index now triggers a background rebuild instead of blocking the search response
-
-### Documentation
-
-- publish the v2.1.0 discovery benchmark rerun with the current gate output: `pending_evidence`, `claimAllowed: false`, `24` frozen tasks, `0.75` average usefulness, and `1822.25` average estimated tokens
-- document the current comparator truth instead of stale assumptions: the public proof still has setup failures plus near-empty comparator outputs on this host, so benchmark win claims remain blocked
-- note the new `searchQuality.tokenEstimate` advisory contract: estimates are based on the final serialized response payload and warnings only appear above the 4K-token threshold
-- simplify the setup story around a roots-first contract: roots-capable multi-project sessions, single-project fallback, and explicit `project` retries
-- clarify that issue #63 fixed the architecture and workspace-aware workflow, but issue #2 is still only partially solved when the client does not provide roots or active-project context
-- remove the repo-local `init` / marker-file story from the public setup guidance
-
 ## [1.9.0](https://github.com/PatrickSys/codebase-context/compare/v1.8.2...v1.9.0) (2026-03-19)
 
 
diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@ Here's what codebase-context does:
 
 One tool call returns all of it. Local-first - your code never leaves your machine by default.
 
-See the [v2.0.0 benchmark](./docs/benchmark.md) for the discovery suite results and current gate truth.
+See the [current discovery benchmark](./docs/benchmark.md) for the checked-in proof results and current gate truth.
 
 ### What it looks like
 
@@ -224,7 +224,7 @@ These are the behaviors that make the most difference day-to-day. Copy, trim wha
 
 ## Links
 
-- [Benchmark](./docs/benchmark.md) — v2.0.0 discovery suite results and gate truth
+- [Benchmark](./docs/benchmark.md) — current discovery suite results and gate truth
 - [Demo](./docs/demo.md) — real CLI walkthrough
 - [Client Setup](./docs/client-setup.md) — per-client config, HTTP setup, local build testing
 - [Capabilities Reference](./docs/capabilities.md) — tool API, retrieval pipeline, decision card schema
diff --git a/docs/benchmark.md b/docs/benchmark.md
@@ -1,6 +1,6 @@
 # Discovery Benchmark
 
-This page documents the current public proof slice for `v2.0.0`.
+This page documents the current public discovery proof from the checked-in result artifacts on `master`.
 It is a discovery benchmark, not an implementation-quality benchmark.
 
 ## Scope
@@ -37,48 +37,44 @@ From `results/gate-evaluation.json`:
 - `claimAllowed`: `false`
 - `totalTasks`: `24`
 - `averageUsefulness`: `0.75`
-- `averageEstimatedTokens`: `903.7083333333334`
+- `averageEstimatedTokens`: `1822.25`
 - `bestExampleUsefulnessRate`: `0.125`
 
 Repo-level outputs from the same rerun:
 
 | Repo | Tasks | Avg usefulness | Avg estimated tokens | Best-example usefulness |
 | --- | ---: | ---: | ---: | ---: |
-| `angular-spotify` | 12 | 0.8333 | 1080.6667 | 0.25 |
-| `excalidraw` | 12 | 0.6667 | 726.75 | 0 |
+| `angular-spotify` | 12 | 0.8333 | 2138.4167 | 0.25 |
+| `excalidraw` | 12 | 0.6667 | 1506.0833 | 0 |
 
 ## Gate Truth
 
 The gate is intentionally still blocked.
 
-- The combined suite now covers both public repos.
-- The release claim is still disallowed because comparator evidence remains incomplete.
-- Missing evidence currently includes:
-  - raw Claude Code baseline metrics
-  - GrepAI metrics
-  - jCodeMunch metrics
-  - codebase-memory-mcp metrics
-  - CodeGraphContext metrics
+- The combined suite covers both public repos.
+- `claimAllowed` remains `false` because comparator evidence still does not support a benchmark-win claim.
+- Two comparator lanes now return `status: "ok"`, but both are effectively near-empty on the frozen tasks and contribute `0` average usefulness.
+- Three comparator lanes still fail setup entirely.
 
 ## Comparator Reality
 
-The current comparator artifact records setup failures, not benchmark wins.
+The current comparator artifact records incomplete comparator evidence, not benchmark wins.
 
 | Comparator | Status | Current reason |
 | --- | --- | --- |
-| `codebase-memory-mcp` | `setup_failed` | Installer path still points to the external shell installer |
-| `jCodeMunch` | `setup_failed` | MCP server closes during startup |
+| `codebase-memory-mcp` | `ok` | Runs, but the checked-in artifact still averages `0` usefulness and `5` estimated tokens per task, so it does not yet contribute meaningful benchmark evidence |
+| `jCodeMunch` | `setup_failed` | `MCP error -32000: Connection closed` |
 | `GrepAI` | `setup_failed` | Local Go binary and Ollama model path not present |
-| `CodeGraphContext` | `setup_failed` | MCP server closes during startup |
-| `raw Claude Code` | `setup_failed` | Local `claude` CLI baseline is not installed/authenticated in this environment |
+| `CodeGraphContext` | `setup_failed` | `MCP error -32000: Connection closed` |
+| `raw Claude Code` | `ok` | Runs, but the checked-in artifact still averages `0` usefulness and only `18.5` estimated tokens per task, so it does not yet contribute meaningful benchmark evidence |
 
-`CodeGraphContext` is explicitly part of the frozen comparison frame. It is not omitted from the public story just because the lane still fails to start.
+`CodeGraphContext` remains part of the frozen comparison frame. It is not omitted from the public story just because the lane still fails to start.
 
 ## Important Limitations
 
 - This benchmark measures discovery usefulness and payload cost only.
 - It does not measure implementation correctness, patch quality, or end-to-end task completion.
-- Comparator setup is still environment-sensitive, so the gate remains `pending_evidence`.
+- Comparator setup remains environment-sensitive, and the checked-in comparator outputs are still too weak to justify a claim.
 - The reranker cache is currently corrupted on this machine. During the proof rerun, search fell back to original ordering after `Protobuf parsing failed` while still completing the harness.
 - `averageFirstRelevantHit` remains `null` in the current gate output because this compact response surface does not expose a comparable ranked-hit metric across the incomplete comparator set.
 
diff --git a/docs/capabilities.md b/docs/capabilities.md
@@ -75,7 +75,7 @@ Shared selector inputs:
 
 | Tool                    | Input                                                                                                   | Output                                                                                                                                                                                                                                                                                                                                                                                          |
 | ----------------------- | ------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `search_codebase`       | `query`, optional `intent`, `limit`, `filters`, `includeSnippets`, shared `project`/`project_directory` | Ranked results (`file`, `summary`, `score`, `type`, `trend`, `patternWarning`, `relationships`, `hints`) + `searchQuality` + decision card (`ready`, `nextAction`, `patterns`, `bestExample`, `impact`, `whatWouldHelp`) when `intent="edit"`. Hints capped at 3 per category.                                                                                                                  |
+| `search_codebase`       | `query`, optional `intent`, `limit`, `filters`, `includeSnippets`, shared `project`/`project_directory` | Compact mode returns a bounded result set with `file`, `summary`, `score`, lightweight structural metadata (`symbol`, `symbolKind`, `scope`, `signaturePreview`), and `searchQuality` (`status`, `confidence`, optional `hint`, `tokenEstimate`, `warning`, `rerankerStatus`). Full mode adds richer relationships plus chunk-level `imports`, `exports`, and `complexity`. When `intent="edit"`, a decision card is returned with `ready`, `patterns`, `bestExample`, `impact`, and `whatWouldHelp`. |
 | `get_team_patterns`     | optional `category`, shared `project`/`project_directory`                                               | Pattern frequencies, trends, golden files, conflicts                                                                                                                                                                                                                                                                                                                                            |
 | `get_symbol_references` | `symbol`, optional `limit`, shared `project`/`project_directory`                                        | Concrete symbol usage evidence: `usageCount` + top usage snippets + `confidence` + `isComplete`. `confidence: "syntactic"` means static/source-based only (no runtime or dynamic dispatch). When Tree-sitter + file content are available, comments and string literals are excluded from the scan — the count reflects real identifier nodes only. Replaces the removed `get_component_usage`. |
 | `remember`              | `type`, `category`, `memory`, `reason`, shared `project`/`project_directory`                            | Persists to `.codebase-context/memory.json`                                                                                                                                                                                                                                                                                                                                                     |
@@ -184,6 +184,8 @@ Ordered by execution:
 9. **Symbol-level deduplication** — within each `symbolPath` group, keep only the highest-scoring chunk (prevents duplicate methods from same class clogging results).
 10. **Stage-2 reranking** — cross-encoder (`Xenova/ms-marco-MiniLM-L-6-v2`) triggers when the score between the top files are very close. CPU-only, top-10 bounded.
 11. **Result enrichment** — compact type (`componentType:layer`), pattern momentum (`trend` Rising/Declining only, Stable omitted), `patternWarning`, condensed relationships (`importedByCount`/`hasTests`), structured hints (capped callers/consumers/tests ranked by frequency), scope header for symbol-aware snippets (`// ClassName.methodName`), related memories (capped to 3), search quality assessment with `hint` when low confidence.
+12. **Payload budgeting** — final serialized search responses include `searchQuality.tokenEstimate`; warnings only appear above the 4K-token threshold and differ between compact and full mode.
+13. **Full-mode chunk metadata** — when available, full-mode results surface chunk-level `imports` (top 5), `exports` (top 5), and cyclomatic `complexity`.
 
 ### Defaults