|
1 | 1 | # Discovery Benchmark |
2 | 2 |
|
3 | | -This page documents the current public proof slice for `v2.0.0`. |
| 3 | +This page documents the current public discovery proof from the checked-in result artifacts on `master`. |
4 | 4 | It is a discovery benchmark, not an implementation-quality benchmark. |
5 | 5 |
|
6 | 6 | ## Scope |
@@ -37,48 +37,44 @@ From `results/gate-evaluation.json`: |
37 | 37 | - `claimAllowed`: `false` |
38 | 38 | - `totalTasks`: `24` |
39 | 39 | - `averageUsefulness`: `0.75` |
40 | | -- `averageEstimatedTokens`: `903.7083333333334` |
| 40 | +- `averageEstimatedTokens`: `1822.25` |
41 | 41 | - `bestExampleUsefulnessRate`: `0.125` |
42 | 42 |
|
43 | 43 | Repo-level outputs from the same rerun: |
44 | 44 |
|
45 | 45 | | Repo | Tasks | Avg usefulness | Avg estimated tokens | Best-example usefulness | |
46 | 46 | | --- | ---: | ---: | ---: | ---: | |
47 | | -| `angular-spotify` | 12 | 0.8333 | 1080.6667 | 0.25 | |
48 | | -| `excalidraw` | 12 | 0.6667 | 726.75 | 0 | |
| 47 | +| `angular-spotify` | 12 | 0.8333 | 2138.4167 | 0.25 | |
| 48 | +| `excalidraw` | 12 | 0.6667 | 1506.0833 | 0 | |
49 | 49 |
|
50 | 50 | ## Gate Truth |
51 | 51 |
|
52 | 52 | The gate is intentionally still blocked. |
53 | 53 |
|
54 | | -- The combined suite now covers both public repos. |
55 | | -- The release claim is still disallowed because comparator evidence remains incomplete. |
56 | | -- Missing evidence currently includes: |
57 | | - - raw Claude Code baseline metrics |
58 | | - - GrepAI metrics |
59 | | - - jCodeMunch metrics |
60 | | - - codebase-memory-mcp metrics |
61 | | - - CodeGraphContext metrics |
| 54 | +- The combined suite covers both public repos. |
| 55 | +- `claimAllowed` remains `false` because comparator evidence still does not support a benchmark-win claim. |
| 56 | +- Two comparator lanes now return `status: "ok"`, but both are effectively near-empty on the frozen tasks and contribute `0` average usefulness. |
| 57 | +- Three comparator lanes still fail setup entirely. |
62 | 58 |
|
63 | 59 | ## Comparator Reality |
64 | 60 |
|
65 | | -The current comparator artifact records setup failures, not benchmark wins. |
| 61 | +The current comparator artifact records incomplete comparator evidence, not benchmark wins. |
66 | 62 |
|
67 | 63 | | Comparator | Status | Current reason | |
68 | 64 | | --- | --- | --- | |
69 | | -| `codebase-memory-mcp` | `setup_failed` | Installer path still points to the external shell installer | |
70 | | -| `jCodeMunch` | `setup_failed` | MCP server closes during startup | |
| 65 | +| `codebase-memory-mcp` | `ok` | Runs, but the checked-in artifact still averages `0` usefulness and `5` estimated tokens per task, so it does not yet contribute meaningful benchmark evidence | |
| 66 | +| `jCodeMunch` | `setup_failed` | `MCP error -32000: Connection closed` | |
71 | 67 | | `GrepAI` | `setup_failed` | Local Go binary and Ollama model path not present | |
72 | | -| `CodeGraphContext` | `setup_failed` | MCP server closes during startup | |
73 | | -| `raw Claude Code` | `setup_failed` | Local `claude` CLI baseline is not installed/authenticated in this environment | |
| 68 | +| `CodeGraphContext` | `setup_failed` | `MCP error -32000: Connection closed` | |
| 69 | +| `raw Claude Code` | `ok` | Runs, but the checked-in artifact still averages `0` usefulness and only `18.5` estimated tokens per task, so it does not yet contribute meaningful benchmark evidence | |
74 | 70 |
|
75 | | -`CodeGraphContext` is explicitly part of the frozen comparison frame. It is not omitted from the public story just because the lane still fails to start. |
| 71 | +`CodeGraphContext` remains part of the frozen comparison frame. It is not omitted from the public story just because the lane still fails to start. |
76 | 72 |
|
77 | 73 | ## Important Limitations |
78 | 74 |
|
79 | 75 | - This benchmark measures discovery usefulness and payload cost only. |
80 | 76 | - It does not measure implementation correctness, patch quality, or end-to-end task completion. |
81 | | -- Comparator setup is still environment-sensitive, so the gate remains `pending_evidence`. |
| 77 | +- Comparator setup remains environment-sensitive, and the checked-in comparator outputs are still too weak to justify a claim. |
82 | 78 | - The reranker cache is currently corrupted on this machine. During the proof rerun, search fell back to original ordering after `Protobuf parsing failed` while still completing the harness. |
83 | 79 | - `averageFirstRelevantHit` remains `null` in the current gate output because this compact response surface does not expose a comparable ranked-hit metric across the incomplete comparator set. |
84 | 80 |
|
|
0 commit comments