|
| 1 | +# Official Submission Contract v1 |
| 2 | + |
| 3 | +This document defines what counts as a valid submission for the official SourceBench leaderboard. |
| 4 | + |
| 5 | +## Goal |
| 6 | + |
| 7 | +The official leaderboard is intended to be: |
| 8 | + |
| 9 | +- comparable across systems |
| 10 | +- resistant to benchmark-specific prompt tuning on public queries |
| 11 | +- reproducible |
| 12 | +- auditable |
| 13 | + |
| 14 | +For that reason, official ranking is based on: |
| 15 | + |
| 16 | +- hidden holdout queries |
| 17 | +- fixed judge model version |
| 18 | +- fixed post-processing and metric code |
| 19 | +- server-side execution by the SourceBench team |
| 20 | + |
| 21 | +## Evaluation modes |
| 22 | + |
| 23 | +### 1. Open evaluation |
| 24 | + |
| 25 | +Anyone can run the public pipeline locally. |
| 26 | + |
| 27 | +Open evaluation: |
| 28 | + |
| 29 | +- uses the public query split |
| 30 | +- uses open-source code |
| 31 | +- is suitable for debugging and informal comparison |
| 32 | +- does **not** qualify a model for the official leaderboard by itself |
| 33 | + |
| 34 | +### 2. Official evaluation |
| 35 | + |
| 36 | +Official evaluation is run by the SourceBench team. |
| 37 | + |
| 38 | +Official evaluation: |
| 39 | + |
| 40 | +- uses hidden holdout queries |
| 41 | +- uses the fixed official judging configuration |
| 42 | +- is the only path to appearing on the official leaderboard |
| 43 | + |
| 44 | +Current split target for `v1`: |
| 45 | + |
| 46 | +- public: `65` |
| 47 | +- holdout: `35` |
| 48 | +- balanced by query type (`13 public + 7 holdout` per type) |
| 49 | + |
| 50 | +## Accepted official submission types |
| 51 | + |
| 52 | +### Preferred submission type: endpoint submission |
| 53 | + |
| 54 | +Participants provide: |
| 55 | + |
| 56 | +- `model_name` |
| 57 | +- `api_base` |
| 58 | +- `api_key` |
| 59 | +- `api_format` |
| 60 | + - must be OpenAI-compatible for v1 |
| 61 | +- `generation_config` |
| 62 | + - optional structured config |
| 63 | +- `system_prompt` |
| 64 | + - optional |
| 65 | +- `notes` |
| 66 | + - optional |
| 67 | + |
| 68 | +This is the preferred mode because the SourceBench team can run the model end-to-end on the hidden holdout set. |
| 69 | + |
| 70 | +### Fallback submission type: answer + cited URL submission |
| 71 | + |
| 72 | +If endpoint access cannot be provided, participants may submit per query: |
| 73 | + |
| 74 | +- `query_id` |
| 75 | +- `answer_text` |
| 76 | +- `cited_urls` |
| 77 | +- `raw_response` |
| 78 | + - optional but preferred |
| 79 | +- `model_name` |
| 80 | + |
| 81 | +In this fallback mode: |
| 82 | + |
| 83 | +- SourceBench still runs scraping/post-processing |
| 84 | +- SourceBench still runs Qwen judging |
| 85 | +- SourceBench still computes metrics |
| 86 | + |
| 87 | +### Submission types not valid for official ranking |
| 88 | + |
| 89 | +These are not accepted as final official submissions: |
| 90 | + |
| 91 | +- pre-scraped source text only |
| 92 | +- participant-postprocessed source objects only |
| 93 | +- participant-computed judge scores only |
| 94 | +- participant-computed final metrics only |
| 95 | + |
| 96 | +These may be useful for debugging, but they are not sufficient for official ranking because too much of the standardized pipeline has already been bypassed. |
| 97 | + |
| 98 | +## Required participant fields |
| 99 | + |
| 100 | +Minimum metadata for every official submission: |
| 101 | + |
| 102 | +- `submitter_name` |
| 103 | +- `organization` |
| 104 | + - optional |
| 105 | +- `contact_email` |
| 106 | +- `model_name` |
| 107 | +- `model_version` |
| 108 | + - optional but strongly preferred |
| 109 | +- `web_search_mode` |
| 110 | + - e.g. built-in search |
| 111 | +- `submission_mode` |
| 112 | + - `endpoint` or `answer_url_bundle` |
| 113 | +- `agrees_to_reproducibility_policy` |
| 114 | + - boolean |
| 115 | + |
| 116 | +## Official v1 assumptions |
| 117 | + |
| 118 | +Version `v1` assumes: |
| 119 | + |
| 120 | +- fixed public split |
| 121 | +- fixed hidden holdout split |
| 122 | +- fixed Qwen judge model/version |
| 123 | +- fixed scoring prompts |
| 124 | +- fixed metrics code |
| 125 | +- built-in web-search GEs only |
| 126 | + |
| 127 | +## What SourceBench runs server-side |
| 128 | + |
| 129 | +For endpoint submissions: |
| 130 | + |
| 131 | +1. Send hidden holdout queries to the submitted endpoint. |
| 132 | +2. Capture answer text and cited URLs. |
| 133 | +3. Scrape and normalize source pages. |
| 134 | +4. Run the fixed Qwen judging step. |
| 135 | +5. Compute leaderboard metrics. |
| 136 | +6. Store the resulting official artifact. |
| 137 | + |
| 138 | +For answer + cited URL submissions: |
| 139 | + |
| 140 | +1. Validate the submitted answer bundle. |
| 141 | +2. Scrape and normalize cited URLs. |
| 142 | +3. Run the fixed Qwen judging step. |
| 143 | +4. Compute leaderboard metrics. |
| 144 | +5. Store the resulting official artifact. |
| 145 | + |
| 146 | +## Validation checks |
| 147 | + |
| 148 | +The official pipeline should reject or flag submissions with: |
| 149 | + |
| 150 | +- malformed endpoint config |
| 151 | +- missing cited URLs |
| 152 | +- empty answer text |
| 153 | +- empty or inaccessible source pages |
| 154 | +- too few valid sources after scraping |
| 155 | +- invalid schema |
| 156 | +- judge input mismatch |
| 157 | +- repeated or obviously degenerate outputs |
| 158 | + |
| 159 | +## Holdout policy |
| 160 | + |
| 161 | +The exact hidden holdout queries must not be part of the public leaderboard package. |
| 162 | + |
| 163 | +Publicly visible: |
| 164 | + |
| 165 | +- split policy |
| 166 | +- split size |
| 167 | +- query-type coverage |
| 168 | + |
| 169 | +Not publicly visible: |
| 170 | + |
| 171 | +- official hidden query text |
| 172 | +- hidden query IDs in public release builds |
| 173 | + |
| 174 | +The holdout file should be stored only in the official evaluation environment, not inside the public repository. |
| 175 | + |
| 176 | +## Publishing rule |
| 177 | + |
| 178 | +A model should appear on the official leaderboard only when: |
| 179 | + |
| 180 | +- the submission passed validation |
| 181 | +- the SourceBench team ran the official hidden evaluation |
| 182 | +- the resulting artifact was generated by the official pipeline |
| 183 | + |
| 184 | +## Recommended official submission JSON schema |
| 185 | + |
| 186 | +Endpoint submission: |
| 187 | + |
| 188 | +```json |
| 189 | +{ |
| 190 | + "submission_mode": "endpoint", |
| 191 | + "submitter_name": "Example Researcher", |
| 192 | + "contact_email": "researcher@example.com", |
| 193 | + "model_name": "example-model", |
| 194 | + "model_version": "2026-03", |
| 195 | + "api_base": "https://api.example.com/v1", |
| 196 | + "api_key": "SECRET", |
| 197 | + "api_format": "openai-compatible", |
| 198 | + "web_search_mode": "built-in-search", |
| 199 | + "generation_config": { |
| 200 | + "temperature": 0 |
| 201 | + }, |
| 202 | + "system_prompt": null, |
| 203 | + "notes": "Optional notes", |
| 204 | + "agrees_to_reproducibility_policy": true |
| 205 | +} |
| 206 | +``` |
| 207 | + |
| 208 | +Answer + cited URL submission: |
| 209 | + |
| 210 | +```json |
| 211 | +{ |
| 212 | + "submission_mode": "answer_url_bundle", |
| 213 | + "submitter_name": "Example Researcher", |
| 214 | + "contact_email": "researcher@example.com", |
| 215 | + "model_name": "example-model", |
| 216 | + "model_version": "2026-03", |
| 217 | + "web_search_mode": "built-in-search", |
| 218 | + "agrees_to_reproducibility_policy": true, |
| 219 | + "runs": [ |
| 220 | + { |
| 221 | + "query_id": 101, |
| 222 | + "answer_text": "Model answer here", |
| 223 | + "cited_urls": [ |
| 224 | + "https://example.com/a", |
| 225 | + "https://example.com/b" |
| 226 | + ], |
| 227 | + "raw_response": {} |
| 228 | + } |
| 229 | + ] |
| 230 | +} |
| 231 | +``` |
| 232 | + |
| 233 | +## Immediate next engineering step |
| 234 | + |
| 235 | +Validation CLI now exists at: |
| 236 | + |
| 237 | +- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/src/evaluation/validate_official_submission.py` |
| 238 | + |
| 239 | +Minimal intake backend now exists at: |
| 240 | + |
| 241 | +- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/src/evaluation/official_submission_backend.py` |
| 242 | + |
| 243 | +Minimal official runner now exists at: |
| 244 | + |
| 245 | +- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/src/evaluation/official_run.py` |
| 246 | + |
| 247 | +Example submission templates now exist at: |
| 248 | + |
| 249 | +- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/leaderboard/examples/endpoint_submission.example.json` |
| 250 | +- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/leaderboard/examples/answer_url_bundle.example.json` |
| 251 | + |
| 252 | +Example: |
| 253 | + |
| 254 | +```bash |
| 255 | +python src/evaluation/official_submission_backend.py \ |
| 256 | + --input leaderboard/examples/endpoint_submission.example.json |
| 257 | +``` |
| 258 | + |
| 259 | +Next implementation steps after this contract are: |
| 260 | + |
| 261 | +- replace the current Stage 1 adapter with the final endpoint answer-and-citation collection path |
| 262 | +- harden the secure execution environment for stored secrets and holdout data |
| 263 | +- keep the open leaderboard reading from the public split |
0 commit comments