|
1 | 1 | # Discovery Benchmark |
2 | 2 |
|
3 | | -This page documents the current public proof slice for `v2.0.0`. |
| 3 | +This page documents the current public proof slice for `v2.1.0`. |
4 | 4 | It is a discovery benchmark, not an implementation-quality benchmark. |
5 | 5 |
|
6 | 6 | ## Scope |
@@ -37,49 +37,51 @@ From `results/gate-evaluation.json`: |
37 | 37 | - `claimAllowed`: `false` |
38 | 38 | - `totalTasks`: `24` |
39 | 39 | - `averageUsefulness`: `0.75` |
40 | | -- `averageEstimatedTokens`: `903.7083333333334` |
| 40 | +- `averagePayloadBytes`: `7287.625` |
| 41 | +- `averageEstimatedTokens`: `1822.25` |
| 42 | +- `averageFirstRelevantHit`: `null` |
41 | 43 | - `bestExampleUsefulnessRate`: `0.125` |
42 | 44 |
|
43 | 45 | Repo-level outputs from the same rerun: |
44 | 46 |
|
45 | | -| Repo | Tasks | Avg usefulness | Avg estimated tokens | Best-example usefulness | |
46 | | -| --- | ---: | ---: | ---: | ---: | |
47 | | -| `angular-spotify` | 12 | 0.8333 | 1080.6667 | 0.25 | |
48 | | -| `excalidraw` | 12 | 0.6667 | 726.75 | 0 | |
| 47 | +| Repo | Tasks | Avg usefulness | Avg payload bytes | Avg estimated tokens | Best-example usefulness | |
| 48 | +| --- | ---: | ---: | ---: | ---: | ---: | |
| 49 | +| `angular-spotify` | 12 | 0.8333 | 8553 | 2138 | 0.25 | |
| 50 | +| `excalidraw` | 12 | 0.6667 | 6023 | 1506 | 0 | |
49 | 51 |
|
50 | 52 | ## Gate Truth |
51 | 53 |
|
52 | 54 | The gate is intentionally still blocked. |
53 | 55 |
|
54 | | -- The combined suite now covers both public repos. |
| 56 | +- The combined suite covers both frozen public repos. |
55 | 57 | - The release claim is still disallowed because comparator evidence remains incomplete. |
56 | 58 | - Missing evidence currently includes: |
57 | 59 | - raw Claude Code baseline metrics |
58 | | - - GrepAI metrics |
59 | | - - jCodeMunch metrics |
60 | | - - codebase-memory-mcp metrics |
61 | | - - CodeGraphContext metrics |
| 60 | + - GrepAI comparator metrics |
| 61 | + - jCodeMunch comparator metrics |
| 62 | + - codebase-memory-mcp comparator metrics |
| 63 | + - CodeGraphContext comparator metrics |
62 | 64 |
|
63 | 65 | ## Comparator Reality |
64 | 66 |
|
65 | 67 | The current comparator artifact records setup failures, not benchmark wins. |
66 | 68 |
|
67 | 69 | | Comparator | Status | Current reason | |
68 | 70 | | --- | --- | --- | |
69 | | -| `codebase-memory-mcp` | `setup_failed` | Installer path still points to the external shell installer | |
70 | | -| `jCodeMunch` | `setup_failed` | MCP server closes during startup | |
71 | | -| `GrepAI` | `setup_failed` | Local Go binary and Ollama model path not present | |
72 | | -| `CodeGraphContext` | `setup_failed` | MCP server closes during startup | |
73 | | -| `raw Claude Code` | `setup_failed` | Local `claude` CLI baseline is not installed/authenticated in this environment | |
| 71 | +| `codebase-memory-mcp` | `ok` | The lane now executes on this host, but the captured outputs are near-empty (`19` bytes / `5` tokens on average, `0` usefulness), so the gate still treats it as missing evidence | |
| 72 | +| `jCodeMunch` | `setup_failed` | MCP handshake still closes during startup on this host (`MCP error -32000: Connection closed`) | |
| 73 | +| `GrepAI` | `setup_failed` | Local Go binary and Ollama model path are not present | |
| 74 | +| `CodeGraphContext` | `setup_failed` | MCP handshake still closes during startup on this host (`MCP error -32000: Connection closed`); database prerequisite remains unresolved | |
| 75 | +| `raw Claude Code` | `ok` | The baseline now runs, but the captured outputs remain non-useful (`66.08` bytes / `17.17` tokens on average, `0` usefulness), so the gate still treats it as missing evidence | |
74 | 76 |
|
75 | | -`CodeGraphContext` is explicitly part of the frozen comparison frame. It is not omitted from the public story just because the lane still fails to start. |
| 77 | +`CodeGraphContext` remains part of the comparison frame. It is not removed from the public story just because the lane still fails to start. |
76 | 78 |
|
77 | 79 | ## Important Limitations |
78 | 80 |
|
79 | 81 | - This benchmark measures discovery usefulness and payload cost only. |
80 | 82 | - It does not measure implementation correctness, patch quality, or end-to-end task completion. |
81 | 83 | - Comparator setup is still environment-sensitive, so the gate remains `pending_evidence`. |
82 | | -- The reranker cache is currently corrupted on this machine. During the proof rerun, search fell back to original ordering after `Protobuf parsing failed` while still completing the harness. |
| 84 | +- Current search payload costs are higher than the older v2.0.0 proof slice because the v2.1.0 surface now includes richer map structure and `searchQuality.tokenEstimate` advisories. |
83 | 85 | - `averageFirstRelevantHit` remains `null` in the current gate output because this compact response surface does not expose a comparable ranked-hit metric across the incomplete comparator set. |
84 | 86 |
|
85 | 87 | ## What This Proof Can Support |
|
0 commit comments