|
| 1 | +# Discovery Benchmark |
| 2 | + |
| 3 | +This page documents the current public proof slice for `v2.0.0`. |
| 4 | +It is a discovery benchmark, not an implementation-quality benchmark. |
| 5 | + |
| 6 | +## Scope |
| 7 | + |
| 8 | +- Frozen fixtures: |
| 9 | + - `tests/fixtures/discovery-angular-spotify.json` |
| 10 | + - `tests/fixtures/discovery-excalidraw.json` |
| 11 | + - `tests/fixtures/discovery-benchmark-protocol.json` |
| 12 | +- Frozen repos used in the current proof run: |
| 13 | + - `repos/angular-spotify` |
| 14 | + - `repos/excalidraw` |
| 15 | +- Current gate artifact: |
| 16 | + - `results/gate-evaluation.json` |
| 17 | +- Comparator evidence: |
| 18 | + - `results/comparator-evidence.json` |
| 19 | + |
| 20 | +## How To Reproduce |
| 21 | + |
| 22 | +Run the repo-local proof artifacts from the current `master` checkout: |
| 23 | + |
| 24 | +```bash |
| 25 | +node scripts/run-eval.mjs repos/angular-spotify --mode=discovery --fixture-a=tests/fixtures/discovery-angular-spotify.json --skip-reindex --output=results/codebase-context-angular-spotify.json |
| 26 | +node scripts/run-eval.mjs repos/excalidraw --mode=discovery --fixture-a=tests/fixtures/discovery-excalidraw.json --skip-reindex --output=results/codebase-context-excalidraw.json |
| 27 | +node scripts/benchmark-comparators.mjs --repos repos/angular-spotify,repos/excalidraw --output results/comparator-evidence.json |
| 28 | +node scripts/run-eval.mjs repos/angular-spotify repos/excalidraw --mode=discovery --fixture-a=tests/fixtures/discovery-angular-spotify.json --fixture-b=tests/fixtures/discovery-excalidraw.json --competitor-results=results/comparator-evidence.json --skip-reindex --output=results/gate-evaluation.json |
| 29 | +``` |
| 30 | + |
| 31 | +## Current Result |
| 32 | + |
| 33 | +From `results/gate-evaluation.json`: |
| 34 | + |
| 35 | +- `status`: `pending_evidence` |
| 36 | +- `suiteStatus`: `complete` |
| 37 | +- `claimAllowed`: `false` |
| 38 | +- `totalTasks`: `24` |
| 39 | +- `averageUsefulness`: `0.75` |
| 40 | +- `averageEstimatedTokens`: `903.7083333333334` |
| 41 | +- `bestExampleUsefulnessRate`: `0.125` |
| 42 | + |
| 43 | +Repo-level outputs from the same rerun: |
| 44 | + |
| 45 | +| Repo | Tasks | Avg usefulness | Avg estimated tokens | Best-example usefulness | |
| 46 | +| --- | ---: | ---: | ---: | ---: | |
| 47 | +| `angular-spotify` | 12 | 0.8333 | 1080.6667 | 0.25 | |
| 48 | +| `excalidraw` | 12 | 0.6667 | 726.75 | 0 | |
| 49 | + |
| 50 | +## Gate Truth |
| 51 | + |
| 52 | +The gate is intentionally still blocked. |
| 53 | + |
| 54 | +- The combined suite now covers both public repos. |
| 55 | +- The release claim is still disallowed because comparator evidence remains incomplete. |
| 56 | +- Missing evidence currently includes: |
| 57 | + - raw Claude Code baseline metrics |
| 58 | + - GrepAI metrics |
| 59 | + - jCodeMunch metrics |
| 60 | + - codebase-memory-mcp metrics |
| 61 | + - CodeGraphContext metrics |
| 62 | + |
| 63 | +## Comparator Reality |
| 64 | + |
| 65 | +The current comparator artifact records setup failures, not benchmark wins. |
| 66 | + |
| 67 | +| Comparator | Status | Current reason | |
| 68 | +| --- | --- | --- | |
| 69 | +| `codebase-memory-mcp` | `setup_failed` | Installer path still points to the external shell installer | |
| 70 | +| `jCodeMunch` | `setup_failed` | MCP server closes during startup | |
| 71 | +| `GrepAI` | `setup_failed` | Local Go binary and Ollama model path not present | |
| 72 | +| `CodeGraphContext` | `setup_failed` | MCP server closes during startup | |
| 73 | +| `raw Claude Code` | `setup_failed` | Local `claude` CLI baseline is not installed/authenticated in this environment | |
| 74 | + |
| 75 | +`CodeGraphContext` is explicitly part of the frozen comparison frame. It is not omitted from the public story just because the lane still fails to start. |
| 76 | + |
| 77 | +## Important Limitations |
| 78 | + |
| 79 | +- This benchmark measures discovery usefulness and payload cost only. |
| 80 | +- It does not measure implementation correctness, patch quality, or end-to-end task completion. |
| 81 | +- Comparator setup is still environment-sensitive, so the gate remains `pending_evidence`. |
| 82 | +- The reranker cache is currently corrupted on this machine. During the proof rerun, search fell back to original ordering after `Protobuf parsing failed` while still completing the harness. |
| 83 | +- `averageFirstRelevantHit` remains `null` in the current gate output because this compact response surface does not expose a comparable ranked-hit metric across the incomplete comparator set. |
| 84 | + |
| 85 | +## What This Proof Can Support |
| 86 | + |
| 87 | +- It can support claims about the shipped discovery surfaces and their current measured outputs on the frozen public tasks. |
| 88 | +- It can support claims that the proof gate is still blocked by comparator evidence. |
| 89 | +- It cannot support claims that `codebase-context` beats the named comparators today. |
| 90 | +- It cannot support claims about edit success, code quality, or implementation speed. |
0 commit comments