Add OuestFrance API integration with Factiva-format S3 pipeline#504
Open
Add OuestFrance API integration with Factiva-format S3 pipeline#504
Conversation
Add support for ingesting OuestFrance ecology articles via their custom XML API, converting to Factiva-format JSON, and uploading to S3 where the existing Factiva S3→PostgreSQL pipeline processes them automatically. Key changes: - Pydantic schema (article_schema.py) for validating Factiva JSON at write time - OuestFrance XML→S3 converter (api_to_s3.py) with full field mapping - OUESTFR source code added to classification + Alembic migration - Optional article_url/tags columns on factiva_articles & lemonde_ftp_articles - LeMonde FTP retrofitted with Pydantic validation - OuestFrance excluded from DBT dashboard models (no total article counts) - Dockerfile and entrypoint for the OuestFrance API→S3 job https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
… tests - Fix test_stop_word_get_top_keywords_by_channel to match current keyword dictionary output (climatique, énergie fossile) - Add 29 tests for Pydantic schema validation and OuestFrance XML parsing https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
Merge Dockerfile_ouest_france and Dockerfile_lemonde_web into one Dockerfile with a parameterized entrypoint script. The pipeline is selected via DEFAULT_PIPELINE build arg (or CLI arg / env var at runtime). - Dockerfile_lemonde_web defaults to "lemonde" (backward compatible) - CI builds ouest_france image with --build-arg DEFAULT_PIPELINE=ouest_france - Remove separate Dockerfile_ouest_france and per-pipeline entrypoint scripts - Add unified docker-entrypoint-press-ingestion.sh https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
- Rename Dockerfile_lemonde_web → Dockerfile_press_ingestion (generic) - Restore separate docker-entrypoint-lemonde-web.sh (original) - Keep separate docker-entrypoint-ouest-france.sh - Remove unified docker-entrypoint-press-ingestion.sh - Dockerfile accepts ENTRYPOINT_SCRIPT build arg to select which script - CI builds lemonde_web image with default (lemonde) entrypoint - CI builds ouest_france image with --build-arg ENTRYPOINT_SCRIPT - Remove /tmp/ouest_france_articles from Dockerfile (created in Python) https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
The Dockerfile just sets CMD to the lemonde script as default. Both images are the same build — ouest_france is a tag alias. The Scaleway job for OuestFrance overrides the command at runtime. https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
…de_web - Set executable bit on entrypoint scripts in git, remove chmod from Dockerfile - Build and push a single press_ingestion image - Retag as lemonde_web for backward compatibility with existing infra - OuestFrance Scaleway job uses press_ingestion image directly (command overridden at runtime to ./docker-entrypoint-ouest-france.sh) https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
Pydantic was overkill — we only used field validators (trivial __post_init__ checks), model_dump (dataclasses.asdict), and extra="allow" (speculative, removed). Standard dataclasses provide the same typing guarantees with zero extra dependency. - Rewrite article_schema.py using @DataClass - Replace model_dump() → to_dict() in all callers - Remove explicit pydantic dependency from pyproject.toml (still available transitively via langchain_core) - Update tests accordingly (28 pass) https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
Auto-detect between <contenus> (original) and RSS <rss><channel><item> format. Key differences in RSS: - <guid> → article ID (instead of <id>) - <description> → body (instead of <texte>) - <dc:creator> → byline (instead of <signature>) - <pubDate> RFC 2822 → parsed to ISO 8601 (instead of <dateParution>) - <enclosure url> → art (instead of <photos>) - <link> → article_url (instead of <url>) - word_count computed from body (no <nombreMots>) - no snippet, tags, or photo credits in RSS format https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
Contenus format (paper articles) → document_type="paper" RSS format (web articles) → document_type="web" This is stored in the existing document_type column in factiva_articles, allowing queries to distinguish paper vs web OuestFrance articles. https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
…essor LeMonde articles don't have these fields — the nullable columns default to NULL without explicit mapping. https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
- <surtitre>: prepended to title if present (e.g. "Environnement - Title") - <localisations>: location names stored in region_of_origin (comma-separated, defaults to "France" when empty) Both formats now fully match the API documentation. https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
Support multiple XML files via OUESTFRANCE_LOCAL_XML (directory, glob, or comma-separated paths). Deduplicate articles by ID before export to prevent the same article from being saved twice to S3. https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
…dicators
- Add load_and_insert_monthly_stats() to read a JSON file with monthly
web/paper article totals and distribute them evenly across days into
stats_factiva_articles. This provides the denominator for ratio calculations.
- Wire it into main() via OUESTFRANCE_MONTHLY_STATS env var.
- Remove OUESTFR from the exclusion list in print_media_crises_indicators.sql,
enabling OuestFrance in the daily and monthly dashboard models.
JSON format: [{"year": 2025, "month": 1, "web": 5000, "paper": 3000}, ...]
https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
…R (web) Like LeMonde (LEMOND/LEMFR), OuestFrance now uses separate source codes: - OUESTFRANCE for paper articles (contenus XML format) - OUESTFRAFR for web articles (RSS XML format) Both share media_all="Ouest-France" for grouped dashboard views. Monthly stats are inserted separately per source code. https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
Web articles (RSS/OUESTFRAFR) now use source_name="Ouest-France.fr" to match the source_classification entry, while paper articles (contenus/OUESTFRANCE) keep source_name="Ouest-France". https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Résumé global
Cette PR ajoute l'import de OuestFrance via leur source spécifique.
L'import en lui même n'est pas complet puisque nous ne connaissons pas encore les modalités d'échange avec leur api/ftp, nous avons uniquement eu accès à un export de leur part. Il faudra recoder cette partie.
Points à noter :
Next is from Claude
Tech Summary
Adds support for ingesting ecology articles from the OuestFrance API, converting them to Factiva-compatible JSON format, and uploading to S3 for processing by the existing Factiva pipeline. This enables OuestFrance as a new regional news source alongside existing Factiva and LeMonde integrations.
Key Changes
New OuestFrance API Integration
quotaclimat/data_ingestion/ouest_france/api_to_s3.py: Complete pipeline to fetch XML from OuestFrance API, parse articles, convert to Factiva format, partition by date, and upload to S3Shared Factiva Schema
quotaclimat/data_ingestion/factiva/schemas/article_schema.py(new): Pydantic models for validating Factiva-format articlesFactivaArticleAttributes: Core article metadata with required fields (an, source_code, source_name) and optional extensions for non-Factiva sourcesFactivaArticleEnvelope: Single article wrapperFactivaS3Document: Top-level S3 JSON document containerLeMonde FTP Refactoring
quotaclimat/data_ingestion/lemonde_ftp/ftp_to_s3.py: Refactored to use sharedFactivaArticleAttributesandFactivaArticleEnvelopemodels instead of raw dictionariesDatabase Schema Extensions
alembic_factiva/versions/n8o9p0q1r2s3_add_ouest_france_support.py(new): Migration to add optional columnsarticle_url(Text) andtags(JSON) columns tofactiva_articlesandlemonde_ftp_articlestablespostgres/schemas/factiva_models.py: Adds corresponding ORM columns for article_url and tagsSource Classification
quotaclimat/data_ingestion/factiva/inputs/classification_source.py: Registers OUESTFR source with metadata (owner: SIPA Ouest-France, region: Bretagne)Docker & Deployment
Dockerfile_ouest_france(new): Multi-stage build for OuestFrance pipelinedocker-entrypoint-ouest-france.sh(new): Entrypoint script orchestrating migrations and pipeline executionConfiguration
pyproject.toml: Added pydantic ^2.0 dependency for schema validationImplementation Details
year_{YYYY}/month_{MM}/{YYYY}_{MM}_{DD}_stream.jsonmatching Factiva S3 structurehttps://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa