WukLab
diff --git a/‎leaderboard/OFFICIAL_SUBMISSION_CONTRACT.md‎
Lines changed: 263 additions & 0 deletions b/‎leaderboard/OFFICIAL_SUBMISSION_CONTRACT.md‎
Lines changed: 263 additions & 0 deletions
diff --git a/‎leaderboard/QUERY_SPLIT_POLICY.md‎
Lines changed: 70 additions & 0 deletions b/‎leaderboard/QUERY_SPLIT_POLICY.md‎
Lines changed: 70 additions & 0 deletions
@@ -0,0 +1,263 @@
+# Official Submission Contract v1
+
+This document defines what counts as a valid submission for the official SourceBench leaderboard.
+
+## Goal
+
+The official leaderboard is intended to be:
+
+- comparable across systems
+- resistant to benchmark-specific prompt tuning on public queries
+- reproducible
+- auditable
+
+For that reason, official ranking is based on:
+
+- hidden holdout queries
+- fixed judge model version
+- fixed post-processing and metric code
+- server-side execution by the SourceBench team
+
+## Evaluation modes
+
+### 1. Open evaluation
+
+Anyone can run the public pipeline locally.
+
+Open evaluation:
+
+- uses the public query split
+- uses open-source code
+- is suitable for debugging and informal comparison
+- does **not** qualify a model for the official leaderboard by itself
+
+### 2. Official evaluation
+
+Official evaluation is run by the SourceBench team.
+
+Official evaluation:
+
+- uses hidden holdout queries
+- uses the fixed official judging configuration
+- is the only path to appearing on the official leaderboard
+
+Current split target for `v1`:
+
+- public: `65`
+- holdout: `35`
+- balanced by query type (`13 public + 7 holdout` per type)
+
+## Accepted official submission types
+
+### Preferred submission type: endpoint submission
+
+Participants provide:
+
+- `model_name`
+- `api_base`
+- `api_key`
+- `api_format`
+  - must be OpenAI-compatible for v1
+- `generation_config`
+  - optional structured config
+- `system_prompt`
+  - optional
+- `notes`
+  - optional
+
+This is the preferred mode because the SourceBench team can run the model end-to-end on the hidden holdout set.
+
+### Fallback submission type: answer + cited URL submission
+
+If endpoint access cannot be provided, participants may submit per query:
+
+- `query_id`
+- `answer_text`
+- `cited_urls`
+- `raw_response`
+  - optional but preferred
+- `model_name`
+
+In this fallback mode:
+
+- SourceBench still runs scraping/post-processing
+- SourceBench still runs Qwen judging
+- SourceBench still computes metrics
+
+### Submission types not valid for official ranking
+
+These are not accepted as final official submissions:
+
+- pre-scraped source text only
+- participant-postprocessed source objects only
+- participant-computed judge scores only
+- participant-computed final metrics only
+
+These may be useful for debugging, but they are not sufficient for official ranking because too much of the standardized pipeline has already been bypassed.
+
+## Required participant fields
+
+Minimum metadata for every official submission:
+
+- `submitter_name`
+- `organization`
+  - optional
+- `contact_email`
+- `model_name`
+- `model_version`
+  - optional but strongly preferred
+- `web_search_mode`
+  - e.g. built-in search
+- `submission_mode`
+  - `endpoint` or `answer_url_bundle`
+- `agrees_to_reproducibility_policy`
+  - boolean
+
+## Official v1 assumptions
+
+Version `v1` assumes:
+
+- fixed public split
+- fixed hidden holdout split
+- fixed Qwen judge model/version
+- fixed scoring prompts
+- fixed metrics code
+- built-in web-search GEs only
+
+## What SourceBench runs server-side
+
+For endpoint submissions:
+
+1. Send hidden holdout queries to the submitted endpoint.
+2. Capture answer text and cited URLs.
+3. Scrape and normalize source pages.
+4. Run the fixed Qwen judging step.
+5. Compute leaderboard metrics.
+6. Store the resulting official artifact.
+
+For answer + cited URL submissions:
+
+1. Validate the submitted answer bundle.
+2. Scrape and normalize cited URLs.
+3. Run the fixed Qwen judging step.
+4. Compute leaderboard metrics.
+5. Store the resulting official artifact.
+
+## Validation checks
+
+The official pipeline should reject or flag submissions with:
+
+- malformed endpoint config
+- missing cited URLs
+- empty answer text
+- empty or inaccessible source pages
+- too few valid sources after scraping
+- invalid schema
+- judge input mismatch
+- repeated or obviously degenerate outputs
+
+## Holdout policy
+
+The exact hidden holdout queries must not be part of the public leaderboard package.
+
+Publicly visible:
+
+- split policy
+- split size
+- query-type coverage
+
+Not publicly visible:
+
+- official hidden query text
+- hidden query IDs in public release builds
+
+The holdout file should be stored only in the official evaluation environment, not inside the public repository.
+
+## Publishing rule
+
+A model should appear on the official leaderboard only when:
+
+- the submission passed validation
+- the SourceBench team ran the official hidden evaluation
+- the resulting artifact was generated by the official pipeline
+
+## Recommended official submission JSON schema
+
+Endpoint submission:
+
+```json
+{
+  "submission_mode": "endpoint",
+  "submitter_name": "Example Researcher",
+  "contact_email": "researcher@example.com",
+  "model_name": "example-model",
+  "model_version": "2026-03",
+  "api_base": "https://api.example.com/v1",
+  "api_key": "SECRET",
+  "api_format": "openai-compatible",
+  "web_search_mode": "built-in-search",
+  "generation_config": {
+    "temperature": 0
+  },
+  "system_prompt": null,
+  "notes": "Optional notes",
+  "agrees_to_reproducibility_policy": true
+}
+```
+
+Answer + cited URL submission:
+
+```json
+{
+  "submission_mode": "answer_url_bundle",
+  "submitter_name": "Example Researcher",
+  "contact_email": "researcher@example.com",
+  "model_name": "example-model",
+  "model_version": "2026-03",
+  "web_search_mode": "built-in-search",
+  "agrees_to_reproducibility_policy": true,
+  "runs": [
+    {
+      "query_id": 101,
+      "answer_text": "Model answer here",
+      "cited_urls": [
+        "https://example.com/a",
+        "https://example.com/b"
+      ],
+      "raw_response": {}
+    }
+  ]
+}
+```
+
+## Immediate next engineering step
+
+Validation CLI now exists at:
+
+- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/src/evaluation/validate_official_submission.py`
+
+Minimal intake backend now exists at:
+
+- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/src/evaluation/official_submission_backend.py`
+
+Minimal official runner now exists at:
+
+- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/src/evaluation/official_run.py`
+
+Example submission templates now exist at:
+
+- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/leaderboard/examples/endpoint_submission.example.json`
+- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/leaderboard/examples/answer_url_bundle.example.json`
+
+Example:
+
+```bash
+python src/evaluation/official_submission_backend.py \
+  --input leaderboard/examples/endpoint_submission.example.json
+```
+
+Next implementation steps after this contract are:
+
+- replace the current Stage 1 adapter with the final endpoint answer-and-citation collection path
+- harden the secure execution environment for stored secrets and holdout data
+- keep the open leaderboard reading from the public split
@@ -0,0 +1,70 @@
+# Query Split Policy v1
+
+This document fixes the first official `public vs holdout` split for SourceBench leaderboard evaluation.
+
+## Scope
+
+Benchmark size:
+
+- total queries: `100`
+- query types: `5`
+- queries per type: `20`
+
+Types:
+
+- `VACOS`
+- `DebateQA`
+- `HotpotQA`
+- `Pinocchios`
+- `QuoraQuestions`
+
+## Fixed split
+
+Version `v1` uses:
+
+- `65` public queries
+- `35` holdout queries
+
+Per query type:
+
+- `13` public
+- `7` holdout
+
+This keeps the query-type distribution balanced between the open leaderboard and the official leaderboard while reserving a larger hidden test set for official validation.
+
+## Split generation principle
+
+The `v1` split was created once from the benchmark master query pool with stratification by query type.
+
+Publicly disclosed:
+
+- total benchmark size
+- query-type taxonomy
+- public/holdout counts
+- balanced per-type allocation
+
+Not publicly disclosed:
+
+- the exact holdout query membership
+- the exact internal selection rule
+- the internal benchmark master file used to materialize the holdout set
+
+## Files
+
+Public split:
+
+- `/Users/kristinx351/Documents/UCSD/Courses_material/Q5/AdsInGenAI/Code/trust_evaluator/data/queries/sourcebench_public_queries_v1.csv`
+
+Holdout split:
+
+- stored outside the public repository in the official evaluation environment
+
+## Important release rule
+
+The holdout query content must not live inside the public repository.
+
+Recommended public-release behavior:
+
+- keep `sourcebench_public_queries_v1.csv` in the public repo
+- keep only the split policy and high-level counts public
+- store the holdout query file only in the official evaluation environment