Skip to content

Commit 9f6170e

Browse files
authored
docs: normalize release truth before release-please cut (#102)
1 parent 922f9fc commit 9f6170e

4 files changed

Lines changed: 20 additions & 48 deletions

File tree

CHANGELOG.md

Lines changed: 0 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -2,32 +2,6 @@
22

33
## Unreleased
44

5-
## [2.1.0](https://github.com/PatrickSys/codebase-context/compare/v1.9.0...v2.1.0) (2026-04-13)
6-
7-
### Features
8-
9-
- **search:** surface chunk intelligence directly in `search_codebase` results, including symbol identity, scope, signature preview, and compact/full response budgeting
10-
- **map:** upgrade the conventions map with structural skeleton sections and add `map --export` so the compact map can be written to `CODEBASE_MAP.md`
11-
- **mcp:** rework multi-project routing so one MCP server can serve multiple projects instead of one hardcoded server entry per repo
12-
- **mcp:** keep explicit `project` as the fallback when the client does not provide enough project context
13-
- **mcp:** accept repo paths, subproject paths, and file paths as `project` selectors when routing is ambiguous
14-
15-
### Bug Fixes
16-
17-
- **metadata:** require real dependency evidence plus multiple framework indicators before labeling a repo as Next.js or another specialized framework
18-
- **reranker:** auto-heal corrupted cross-encoder cache entries and surface degraded reranker state in `searchQuality.rerankerStatus`
19-
- **benchmarks:** harden comparator lanes for cross-platform execution and keep setup failures explicit instead of silently turning them into claims
20-
- **search:** auto-heal on corrupted index now triggers a background rebuild instead of blocking the search response
21-
22-
### Documentation
23-
24-
- publish the v2.1.0 discovery benchmark rerun with the current gate output: `pending_evidence`, `claimAllowed: false`, `24` frozen tasks, `0.75` average usefulness, and `1822.25` average estimated tokens
25-
- document the current comparator truth instead of stale assumptions: the public proof still has setup failures plus near-empty comparator outputs on this host, so benchmark win claims remain blocked
26-
- note the new `searchQuality.tokenEstimate` advisory contract: estimates are based on the final serialized response payload and warnings only appear above the 4K-token threshold
27-
- simplify the setup story around a roots-first contract: roots-capable multi-project sessions, single-project fallback, and explicit `project` retries
28-
- clarify that issue #63 fixed the architecture and workspace-aware workflow, but issue #2 is still only partially solved when the client does not provide roots or active-project context
29-
- remove the repo-local `init` / marker-file story from the public setup guidance
30-
315
## [1.9.0](https://github.com/PatrickSys/codebase-context/compare/v1.8.2...v1.9.0) (2026-03-19)
326

337

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ Here's what codebase-context does:
2020

2121
One tool call returns all of it. Local-first - your code never leaves your machine by default.
2222

23-
See the [v2.0.0 benchmark](./docs/benchmark.md) for the discovery suite results and current gate truth.
23+
See the [current discovery benchmark](./docs/benchmark.md) for the checked-in proof results and current gate truth.
2424

2525
### What it looks like
2626

@@ -224,7 +224,7 @@ These are the behaviors that make the most difference day-to-day. Copy, trim wha
224224

225225
## Links
226226

227-
- [Benchmark](./docs/benchmark.md)v2.0.0 discovery suite results and gate truth
227+
- [Benchmark](./docs/benchmark.md)current discovery suite results and gate truth
228228
- [Demo](./docs/demo.md) — real CLI walkthrough
229229
- [Client Setup](./docs/client-setup.md) — per-client config, HTTP setup, local build testing
230230
- [Capabilities Reference](./docs/capabilities.md) — tool API, retrieval pipeline, decision card schema

docs/benchmark.md

Lines changed: 15 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Discovery Benchmark
22

3-
This page documents the current public proof slice for `v2.0.0`.
3+
This page documents the current public discovery proof from the checked-in result artifacts on `master`.
44
It is a discovery benchmark, not an implementation-quality benchmark.
55

66
## Scope
@@ -37,48 +37,44 @@ From `results/gate-evaluation.json`:
3737
- `claimAllowed`: `false`
3838
- `totalTasks`: `24`
3939
- `averageUsefulness`: `0.75`
40-
- `averageEstimatedTokens`: `903.7083333333334`
40+
- `averageEstimatedTokens`: `1822.25`
4141
- `bestExampleUsefulnessRate`: `0.125`
4242

4343
Repo-level outputs from the same rerun:
4444

4545
| Repo | Tasks | Avg usefulness | Avg estimated tokens | Best-example usefulness |
4646
| --- | ---: | ---: | ---: | ---: |
47-
| `angular-spotify` | 12 | 0.8333 | 1080.6667 | 0.25 |
48-
| `excalidraw` | 12 | 0.6667 | 726.75 | 0 |
47+
| `angular-spotify` | 12 | 0.8333 | 2138.4167 | 0.25 |
48+
| `excalidraw` | 12 | 0.6667 | 1506.0833 | 0 |
4949

5050
## Gate Truth
5151

5252
The gate is intentionally still blocked.
5353

54-
- The combined suite now covers both public repos.
55-
- The release claim is still disallowed because comparator evidence remains incomplete.
56-
- Missing evidence currently includes:
57-
- raw Claude Code baseline metrics
58-
- GrepAI metrics
59-
- jCodeMunch metrics
60-
- codebase-memory-mcp metrics
61-
- CodeGraphContext metrics
54+
- The combined suite covers both public repos.
55+
- `claimAllowed` remains `false` because comparator evidence still does not support a benchmark-win claim.
56+
- Two comparator lanes now return `status: "ok"`, but both are effectively near-empty on the frozen tasks and contribute `0` average usefulness.
57+
- Three comparator lanes still fail setup entirely.
6258

6359
## Comparator Reality
6460

65-
The current comparator artifact records setup failures, not benchmark wins.
61+
The current comparator artifact records incomplete comparator evidence, not benchmark wins.
6662

6763
| Comparator | Status | Current reason |
6864
| --- | --- | --- |
69-
| `codebase-memory-mcp` | `setup_failed` | Installer path still points to the external shell installer |
70-
| `jCodeMunch` | `setup_failed` | MCP server closes during startup |
65+
| `codebase-memory-mcp` | `ok` | Runs, but the checked-in artifact still averages `0` usefulness and `5` estimated tokens per task, so it does not yet contribute meaningful benchmark evidence |
66+
| `jCodeMunch` | `setup_failed` | `MCP error -32000: Connection closed` |
7167
| `GrepAI` | `setup_failed` | Local Go binary and Ollama model path not present |
72-
| `CodeGraphContext` | `setup_failed` | MCP server closes during startup |
73-
| `raw Claude Code` | `setup_failed` | Local `claude` CLI baseline is not installed/authenticated in this environment |
68+
| `CodeGraphContext` | `setup_failed` | `MCP error -32000: Connection closed` |
69+
| `raw Claude Code` | `ok` | Runs, but the checked-in artifact still averages `0` usefulness and only `18.5` estimated tokens per task, so it does not yet contribute meaningful benchmark evidence |
7470

75-
`CodeGraphContext` is explicitly part of the frozen comparison frame. It is not omitted from the public story just because the lane still fails to start.
71+
`CodeGraphContext` remains part of the frozen comparison frame. It is not omitted from the public story just because the lane still fails to start.
7672

7773
## Important Limitations
7874

7975
- This benchmark measures discovery usefulness and payload cost only.
8076
- It does not measure implementation correctness, patch quality, or end-to-end task completion.
81-
- Comparator setup is still environment-sensitive, so the gate remains `pending_evidence`.
77+
- Comparator setup remains environment-sensitive, and the checked-in comparator outputs are still too weak to justify a claim.
8278
- The reranker cache is currently corrupted on this machine. During the proof rerun, search fell back to original ordering after `Protobuf parsing failed` while still completing the harness.
8379
- `averageFirstRelevantHit` remains `null` in the current gate output because this compact response surface does not expose a comparable ranked-hit metric across the incomplete comparator set.
8480

docs/capabilities.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ Shared selector inputs:
7575

7676
| Tool | Input | Output |
7777
| ----------------------- | ------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
78-
| `search_codebase` | `query`, optional `intent`, `limit`, `filters`, `includeSnippets`, shared `project`/`project_directory` | Ranked results (`file`, `summary`, `score`, `type`, `trend`, `patternWarning`, `relationships`, `hints`) + `searchQuality` + decision card (`ready`, `nextAction`, `patterns`, `bestExample`, `impact`, `whatWouldHelp`) when `intent="edit"`. Hints capped at 3 per category. |
78+
| `search_codebase` | `query`, optional `intent`, `limit`, `filters`, `includeSnippets`, shared `project`/`project_directory` | Compact mode returns a bounded result set with `file`, `summary`, `score`, lightweight structural metadata (`symbol`, `symbolKind`, `scope`, `signaturePreview`), and `searchQuality` (`status`, `confidence`, optional `hint`, `tokenEstimate`, `warning`, `rerankerStatus`). Full mode adds richer relationships plus chunk-level `imports`, `exports`, and `complexity`. When `intent="edit"`, a decision card is returned with `ready`, `patterns`, `bestExample`, `impact`, and `whatWouldHelp`. |
7979
| `get_team_patterns` | optional `category`, shared `project`/`project_directory` | Pattern frequencies, trends, golden files, conflicts |
8080
| `get_symbol_references` | `symbol`, optional `limit`, shared `project`/`project_directory` | Concrete symbol usage evidence: `usageCount` + top usage snippets + `confidence` + `isComplete`. `confidence: "syntactic"` means static/source-based only (no runtime or dynamic dispatch). When Tree-sitter + file content are available, comments and string literals are excluded from the scan — the count reflects real identifier nodes only. Replaces the removed `get_component_usage`. |
8181
| `remember` | `type`, `category`, `memory`, `reason`, shared `project`/`project_directory` | Persists to `.codebase-context/memory.json` |
@@ -184,6 +184,8 @@ Ordered by execution:
184184
9. **Symbol-level deduplication** — within each `symbolPath` group, keep only the highest-scoring chunk (prevents duplicate methods from same class clogging results).
185185
10. **Stage-2 reranking** — cross-encoder (`Xenova/ms-marco-MiniLM-L-6-v2`) triggers when the score between the top files are very close. CPU-only, top-10 bounded.
186186
11. **Result enrichment** — compact type (`componentType:layer`), pattern momentum (`trend` Rising/Declining only, Stable omitted), `patternWarning`, condensed relationships (`importedByCount`/`hasTests`), structured hints (capped callers/consumers/tests ranked by frequency), scope header for symbol-aware snippets (`// ClassName.methodName`), related memories (capped to 3), search quality assessment with `hint` when low confidence.
187+
12. **Payload budgeting** — final serialized search responses include `searchQuality.tokenEstimate`; warnings only appear above the 4K-token threshold and differ between compact and full mode.
188+
13. **Full-mode chunk metadata** — when available, full-mode results surface chunk-level `imports` (top 5), `exports` (top 5), and cyclomatic `complexity`.
187189

188190
### Defaults
189191

0 commit comments

Comments
 (0)