Skip to content

Add parser ground truth tooling and arXiv dataset#64

Draft
AymanL wants to merge 1 commit intomainfrom
feat/parser-ground-truth
Draft

Add parser ground truth tooling and arXiv dataset#64
AymanL wants to merge 1 commit intomainfrom
feat/parser-ground-truth

Conversation

@AymanL
Copy link
Copy Markdown
Collaborator

@AymanL AymanL commented Apr 16, 2026

Summary

  • Parser ground truth tooling — two new scripts under data_collection/:
    • ground_truth.py — queries arXiv API to collect verified reference articles (LaTeX source = clean ground truth, no PDF extraction artifacts)
    • download_ground_truth.py — downloads PDFs and extracts .tex source text in parallel, producing a local verified_ground_truth_data/ corpus for parser benchmarking
  • verified_ground_truth.csv — committed seed dataset of 33 arXiv articles across vaccine/autism and control categories
  • Tests & factories — expanded model tests and added IngestionRun / ParsedArtifact factory stubs

Test plan

  • Run pytest tests/ingestion/test_models.py — all model tests pass
  • Apply migrations cleanly on a fresh DB (manage.py migrate)
  • Run python -m eu_fact_force.ingestion.data_collection.ground_truth --vaccine-limit 5 --other-limit 5 and confirm CSV output
  • Run python -m eu_fact_force.ingestion.data_collection.download_ground_truth --csv verified_ground_truth.csv --output-dir /tmp/gt_test --workers 2 and verify PDFs + .txt files are written

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant