Skip to content

Commit bc809f9

Browse files
committed
Add SourceBench leaderboard site
1 parent 0341855 commit bc809f9

11 files changed

Lines changed: 36718 additions & 0 deletions
Lines changed: 263 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,263 @@
1+
# Official Submission Contract v1
2+
3+
This document defines what counts as a valid submission for the official SourceBench leaderboard.
4+
5+
## Goal
6+
7+
The official leaderboard is intended to be:
8+
9+
- comparable across systems
10+
- resistant to benchmark-specific prompt tuning on public queries
11+
- reproducible
12+
- auditable
13+
14+
For that reason, official ranking is based on:
15+
16+
- hidden holdout queries
17+
- fixed judge model version
18+
- fixed post-processing and metric code
19+
- server-side execution by the SourceBench team
20+
21+
## Evaluation modes
22+
23+
### 1. Open evaluation
24+
25+
Anyone can run the public pipeline locally.
26+
27+
Open evaluation:
28+
29+
- uses the public query split
30+
- uses open-source code
31+
- is suitable for debugging and informal comparison
32+
- does **not** qualify a model for the official leaderboard by itself
33+
34+
### 2. Official evaluation
35+
36+
Official evaluation is run by the SourceBench team.
37+
38+
Official evaluation:
39+
40+
- uses hidden holdout queries
41+
- uses the fixed official judging configuration
42+
- is the only path to appearing on the official leaderboard
43+
44+
Current split target for `v1`:
45+
46+
- public: `65`
47+
- holdout: `35`
48+
- balanced by query type (`13 public + 7 holdout` per type)
49+
50+
## Accepted official submission types
51+
52+
### Preferred submission type: endpoint submission
53+
54+
Participants provide:
55+
56+
- `model_name`
57+
- `api_base`
58+
- `api_key`
59+
- `api_format`
60+
- must be OpenAI-compatible for v1
61+
- `generation_config`
62+
- optional structured config
63+
- `system_prompt`
64+
- optional
65+
- `notes`
66+
- optional
67+
68+
This is the preferred mode because the SourceBench team can run the model end-to-end on the hidden holdout set.
69+
70+
### Fallback submission type: answer + cited URL submission
71+
72+
If endpoint access cannot be provided, participants may submit per query:
73+
74+
- `query_id`
75+
- `answer_text`
76+
- `cited_urls`
77+
- `raw_response`
78+
- optional but preferred
79+
- `model_name`
80+
81+
In this fallback mode:
82+
83+
- SourceBench still runs scraping/post-processing
84+
- SourceBench still runs Qwen judging
85+
- SourceBench still computes metrics
86+
87+
### Submission types not valid for official ranking
88+
89+
These are not accepted as final official submissions:
90+
91+
- pre-scraped source text only
92+
- participant-postprocessed source objects only
93+
- participant-computed judge scores only
94+
- participant-computed final metrics only
95+
96+
These may be useful for debugging, but they are not sufficient for official ranking because too much of the standardized pipeline has already been bypassed.
97+
98+
## Required participant fields
99+
100+
Minimum metadata for every official submission:
101+
102+
- `submitter_name`
103+
- `organization`
104+
- optional
105+
- `contact_email`
106+
- `model_name`
107+
- `model_version`
108+
- optional but strongly preferred
109+
- `web_search_mode`
110+
- e.g. built-in search
111+
- `submission_mode`
112+
- `endpoint` or `answer_url_bundle`
113+
- `agrees_to_reproducibility_policy`
114+
- boolean
115+
116+
## Official v1 assumptions
117+
118+
Version `v1` assumes:
119+
120+
- fixed public split
121+
- fixed hidden holdout split
122+
- fixed Qwen judge model/version
123+
- fixed scoring prompts
124+
- fixed metrics code
125+
- built-in web-search GEs only
126+
127+
## What SourceBench runs server-side
128+
129+
For endpoint submissions:
130+
131+
1. Send hidden holdout queries to the submitted endpoint.
132+
2. Capture answer text and cited URLs.
133+
3. Scrape and normalize source pages.
134+
4. Run the fixed Qwen judging step.
135+
5. Compute leaderboard metrics.
136+
6. Store the resulting official artifact.
137+
138+
For answer + cited URL submissions:
139+
140+
1. Validate the submitted answer bundle.
141+
2. Scrape and normalize cited URLs.
142+
3. Run the fixed Qwen judging step.
143+
4. Compute leaderboard metrics.
144+
5. Store the resulting official artifact.
145+
146+
## Validation checks
147+
148+
The official pipeline should reject or flag submissions with:
149+
150+
- malformed endpoint config
151+
- missing cited URLs
152+
- empty answer text
153+
- empty or inaccessible source pages
154+
- too few valid sources after scraping
155+
- invalid schema
156+
- judge input mismatch
157+
- repeated or obviously degenerate outputs
158+
159+
## Holdout policy
160+
161+
The exact hidden holdout queries must not be part of the public leaderboard package.
162+
163+
Publicly visible:
164+
165+
- split policy
166+
- split size
167+
- query-type coverage
168+
169+
Not publicly visible:
170+
171+
- official hidden query text
172+
- hidden query IDs in public release builds
173+
174+
The holdout file should be stored only in the official evaluation environment, not inside the public repository.
175+
176+
## Publishing rule
177+
178+
A model should appear on the official leaderboard only when:
179+
180+
- the submission passed validation
181+
- the SourceBench team ran the official hidden evaluation
182+
- the resulting artifact was generated by the official pipeline
183+
184+
## Recommended official submission JSON schema
185+
186+
Endpoint submission:
187+
188+
```json
189+
{
190+
"submission_mode": "endpoint",
191+
"submitter_name": "Example Researcher",
192+
"contact_email": "researcher@example.com",
193+
"model_name": "example-model",
194+
"model_version": "2026-03",
195+
"api_base": "https://api.example.com/v1",
196+
"api_key": "SECRET",
197+
"api_format": "openai-compatible",
198+
"web_search_mode": "built-in-search",
199+
"generation_config": {
200+
"temperature": 0
201+
},
202+
"system_prompt": null,
203+
"notes": "Optional notes",
204+
"agrees_to_reproducibility_policy": true
205+
}
206+
```
207+
208+
Answer + cited URL submission:
209+
210+
```json
211+
{
212+
"submission_mode": "answer_url_bundle",
213+
"submitter_name": "Example Researcher",
214+
"contact_email": "researcher@example.com",
215+
"model_name": "example-model",
216+
"model_version": "2026-03",
217+
"web_search_mode": "built-in-search",
218+
"agrees_to_reproducibility_policy": true,
219+
"runs": [
220+
{
221+
"query_id": 101,
222+
"answer_text": "Model answer here",
223+
"cited_urls": [
224+
"https://example.com/a",
225+
"https://example.com/b"
226+
],
227+
"raw_response": {}
228+
}
229+
]
230+
}
231+
```
232+
233+
## Immediate next engineering step
234+
235+
Validation CLI now exists at:
236+
237+
- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/src/evaluation/validate_official_submission.py`
238+
239+
Minimal intake backend now exists at:
240+
241+
- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/src/evaluation/official_submission_backend.py`
242+
243+
Minimal official runner now exists at:
244+
245+
- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/src/evaluation/official_run.py`
246+
247+
Example submission templates now exist at:
248+
249+
- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/leaderboard/examples/endpoint_submission.example.json`
250+
- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/leaderboard/examples/answer_url_bundle.example.json`
251+
252+
Example:
253+
254+
```bash
255+
python src/evaluation/official_submission_backend.py \
256+
--input leaderboard/examples/endpoint_submission.example.json
257+
```
258+
259+
Next implementation steps after this contract are:
260+
261+
- replace the current Stage 1 adapter with the final endpoint answer-and-citation collection path
262+
- harden the secure execution environment for stored secrets and holdout data
263+
- keep the open leaderboard reading from the public split

leaderboard/QUERY_SPLIT_POLICY.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Query Split Policy v1
2+
3+
This document fixes the first official `public vs holdout` split for SourceBench leaderboard evaluation.
4+
5+
## Scope
6+
7+
Benchmark size:
8+
9+
- total queries: `100`
10+
- query types: `5`
11+
- queries per type: `20`
12+
13+
Types:
14+
15+
- `VACOS`
16+
- `DebateQA`
17+
- `HotpotQA`
18+
- `Pinocchios`
19+
- `QuoraQuestions`
20+
21+
## Fixed split
22+
23+
Version `v1` uses:
24+
25+
- `65` public queries
26+
- `35` holdout queries
27+
28+
Per query type:
29+
30+
- `13` public
31+
- `7` holdout
32+
33+
This keeps the query-type distribution balanced between the open leaderboard and the official leaderboard while reserving a larger hidden test set for official validation.
34+
35+
## Split generation principle
36+
37+
The `v1` split was created once from the benchmark master query pool with stratification by query type.
38+
39+
Publicly disclosed:
40+
41+
- total benchmark size
42+
- query-type taxonomy
43+
- public/holdout counts
44+
- balanced per-type allocation
45+
46+
Not publicly disclosed:
47+
48+
- the exact holdout query membership
49+
- the exact internal selection rule
50+
- the internal benchmark master file used to materialize the holdout set
51+
52+
## Files
53+
54+
Public split:
55+
56+
- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/data/queries/sourcebench_public_queries_v1.csv`
57+
58+
Holdout split:
59+
60+
- stored outside the public repository in the official evaluation environment
61+
62+
## Important release rule
63+
64+
The holdout query content must not live inside the public repository.
65+
66+
Recommended public-release behavior:
67+
68+
- keep `sourcebench_public_queries_v1.csv` in the public repo
69+
- keep only the split policy and high-level counts public
70+
- store the holdout query file only in the official evaluation environment

0 commit comments

Comments
 (0)