|
1 | 1 | # Discovery Benchmark |
2 | 2 |
|
3 | | -This page documents the current public proof slice for `v2.1.0`. |
| 3 | +This page documents the current public proof slice for `v2.0.0`. |
4 | 4 | It is a discovery benchmark, not an implementation-quality benchmark. |
5 | 5 |
|
6 | 6 | ## Scope |
@@ -37,51 +37,49 @@ From `results/gate-evaluation.json`: |
37 | 37 | - `claimAllowed`: `false` |
38 | 38 | - `totalTasks`: `24` |
39 | 39 | - `averageUsefulness`: `0.75` |
40 | | -- `averagePayloadBytes`: `7287.625` |
41 | | -- `averageEstimatedTokens`: `1822.25` |
42 | | -- `averageFirstRelevantHit`: `null` |
| 40 | +- `averageEstimatedTokens`: `903.7083333333334` |
43 | 41 | - `bestExampleUsefulnessRate`: `0.125` |
44 | 42 |
|
45 | 43 | Repo-level outputs from the same rerun: |
46 | 44 |
|
47 | | -| Repo | Tasks | Avg usefulness | Avg payload bytes | Avg estimated tokens | Best-example usefulness | |
48 | | -| --- | ---: | ---: | ---: | ---: | ---: | |
49 | | -| `angular-spotify` | 12 | 0.8333 | 8553 | 2138 | 0.25 | |
50 | | -| `excalidraw` | 12 | 0.6667 | 6023 | 1506 | 0 | |
| 45 | +| Repo | Tasks | Avg usefulness | Avg estimated tokens | Best-example usefulness | |
| 46 | +| --- | ---: | ---: | ---: | ---: | |
| 47 | +| `angular-spotify` | 12 | 0.8333 | 1080.6667 | 0.25 | |
| 48 | +| `excalidraw` | 12 | 0.6667 | 726.75 | 0 | |
51 | 49 |
|
52 | 50 | ## Gate Truth |
53 | 51 |
|
54 | 52 | The gate is intentionally still blocked. |
55 | 53 |
|
56 | | -- The combined suite covers both frozen public repos. |
| 54 | +- The combined suite now covers both public repos. |
57 | 55 | - The release claim is still disallowed because comparator evidence remains incomplete. |
58 | 56 | - Missing evidence currently includes: |
59 | 57 | - raw Claude Code baseline metrics |
60 | | - - GrepAI comparator metrics |
61 | | - - jCodeMunch comparator metrics |
62 | | - - codebase-memory-mcp comparator metrics |
63 | | - - CodeGraphContext comparator metrics |
| 58 | + - GrepAI metrics |
| 59 | + - jCodeMunch metrics |
| 60 | + - codebase-memory-mcp metrics |
| 61 | + - CodeGraphContext metrics |
64 | 62 |
|
65 | 63 | ## Comparator Reality |
66 | 64 |
|
67 | 65 | The current comparator artifact records setup failures, not benchmark wins. |
68 | 66 |
|
69 | 67 | | Comparator | Status | Current reason | |
70 | 68 | | --- | --- | --- | |
71 | | -| `codebase-memory-mcp` | `ok` | The lane now executes on this host, but the captured outputs are near-empty (`19` bytes / `5` tokens on average, `0` usefulness), so the gate still treats it as missing evidence | |
72 | | -| `jCodeMunch` | `setup_failed` | MCP handshake still closes during startup on this host (`MCP error -32000: Connection closed`) | |
73 | | -| `GrepAI` | `setup_failed` | Local Go binary and Ollama model path are not present | |
74 | | -| `CodeGraphContext` | `setup_failed` | MCP handshake still closes during startup on this host (`MCP error -32000: Connection closed`); database prerequisite remains unresolved | |
75 | | -| `raw Claude Code` | `ok` | The baseline now runs, but the captured outputs remain non-useful (`66.08` bytes / `17.17` tokens on average, `0` usefulness), so the gate still treats it as missing evidence | |
| 69 | +| `codebase-memory-mcp` | `setup_failed` | Installer path still points to the external shell installer | |
| 70 | +| `jCodeMunch` | `setup_failed` | MCP server closes during startup | |
| 71 | +| `GrepAI` | `setup_failed` | Local Go binary and Ollama model path not present | |
| 72 | +| `CodeGraphContext` | `setup_failed` | MCP server closes during startup | |
| 73 | +| `raw Claude Code` | `setup_failed` | Local `claude` CLI baseline is not installed/authenticated in this environment | |
76 | 74 |
|
77 | | -`CodeGraphContext` remains part of the comparison frame. It is not removed from the public story just because the lane still fails to start. |
| 75 | +`CodeGraphContext` is explicitly part of the frozen comparison frame. It is not omitted from the public story just because the lane still fails to start. |
78 | 76 |
|
79 | 77 | ## Important Limitations |
80 | 78 |
|
81 | 79 | - This benchmark measures discovery usefulness and payload cost only. |
82 | 80 | - It does not measure implementation correctness, patch quality, or end-to-end task completion. |
83 | 81 | - Comparator setup is still environment-sensitive, so the gate remains `pending_evidence`. |
84 | | -- Current search payload costs are higher than the older v2.0.0 proof slice because the v2.1.0 surface now includes richer map structure and `searchQuality.tokenEstimate` advisories. |
| 82 | +- The reranker cache is currently corrupted on this machine. During the proof rerun, search fell back to original ordering after `Protobuf parsing failed` while still completing the harness. |
85 | 83 | - `averageFirstRelevantHit` remains `null` in the current gate output because this compact response surface does not expose a comparable ranked-hit metric across the incomplete comparator set. |
86 | 84 |
|
87 | 85 | ## What This Proof Can Support |
|
0 commit comments