Skip to content

Add OuestFrance API integration with Factiva-format S3 pipeline#504

Open
apibrac wants to merge 19 commits intomainfrom
claude/extract-ecological-themes-Cnn38
Open

Add OuestFrance API integration with Factiva-format S3 pipeline#504
apibrac wants to merge 19 commits intomainfrom
claude/extract-ecological-themes-Cnn38

Conversation

@apibrac
Copy link
Copy Markdown
Collaborator

@apibrac apibrac commented Mar 28, 2026

Résumé global

Cette PR ajoute l'import de OuestFrance via leur source spécifique.
L'import en lui même n'est pas complet puisque nous ne connaissons pas encore les modalités d'échange avec leur api/ftp, nous avons uniquement eu accès à un export de leur part. Il faudra recoder cette partie.

Points à noter :

  • cela copie colle le process pour le monde qui est lui même hors factiva, avec quelques différences notables :
  • on a un unique job api_to_s3, ensuite c'est le job factiva qui traitera les données puisqu'elles sont stockés dans le s3 dans le bon format
  • on réutilise la même image docker que le monde puisqu'elle est similaire sauf l'entrypoint. Peut être qu'on visera à terme à toujours utiliser une unique image docker pour tous les jobs plutôt que d'en avoir une par job ?
  • une étape a été ajouté pour vérifier que les données stockées dans le s3 ont le bon format (celui de factiva) : une dataclass avec tous les champs et qui se charge de la conversion en json, elle a été inséré pour lemonde également

Next is from Claude

Tech Summary

Adds support for ingesting ecology articles from the OuestFrance API, converting them to Factiva-compatible JSON format, and uploading to S3 for processing by the existing Factiva pipeline. This enables OuestFrance as a new regional news source alongside existing Factiva and LeMonde integrations.

Key Changes

New OuestFrance API Integration

  • quotaclimat/data_ingestion/ouest_france/api_to_s3.py: Complete pipeline to fetch XML from OuestFrance API, parse articles, convert to Factiva format, partition by date, and upload to S3
    • Fetches ecology articles via authenticated API requests with configurable date ranges
    • Parses XML elements (title, body, author, publication date, photos, tags)
    • Strips HTML from content fields
    • Validates output against Pydantic schema before S3 upload
    • Partitions articles by year/month/day with configurable batch sizes

Shared Factiva Schema

  • quotaclimat/data_ingestion/factiva/schemas/article_schema.py (new): Pydantic models for validating Factiva-format articles
    • FactivaArticleAttributes: Core article metadata with required fields (an, source_code, source_name) and optional extensions for non-Factiva sources
    • FactivaArticleEnvelope: Single article wrapper
    • FactivaS3Document: Top-level S3 JSON document container
    • Supports article_url and tags fields for regional sources

LeMonde FTP Refactoring

  • quotaclimat/data_ingestion/lemonde_ftp/ftp_to_s3.py: Refactored to use shared FactivaArticleAttributes and FactivaArticleEnvelope models instead of raw dictionaries
    • Improves consistency and validation across all sources
    • Maintains backward compatibility with existing pipeline

Database Schema Extensions

  • alembic_factiva/versions/n8o9p0q1r2s3_add_ouest_france_support.py (new): Migration to add optional columns
    • Adds article_url (Text) and tags (JSON) columns to factiva_articles and lemonde_ftp_articles tables
    • Inserts OUESTFR source classification record
  • postgres/schemas/factiva_models.py: Adds corresponding ORM columns for article_url and tags

Source Classification

  • quotaclimat/data_ingestion/factiva/inputs/classification_source.py: Registers OUESTFR source with metadata (owner: SIPA Ouest-France, region: Bretagne)

Docker & Deployment

  • Dockerfile_ouest_france (new): Multi-stage build for OuestFrance pipeline
    • Runs Alembic migrations before article ingestion
    • Executes OuestFrance API to S3 pipeline
  • docker-entrypoint-ouest-france.sh (new): Entrypoint script orchestrating migrations and pipeline execution

Configuration

  • pyproject.toml: Added pydantic ^2.0 dependency for schema validation

Implementation Details

  • Environment-driven configuration: API URL, token, S3 credentials, date ranges all configurable via environment variables
  • Flexible secret handling: Supports both direct values and file paths for sensitive credentials
  • Batch processing: Articles saved in configurable batches (default 1000) to manage memory usage
  • Date partitioning: Files organized as year_{YYYY}/month_{MM}/{YYYY}_{MM}_{DD}_stream.json matching Factiva S3 structure
  • Validation-first approach: All articles validated against Pydantic schema before S3 upload, catching format errors early
  • Reuses existing pipeline: Converted articles automatically picked up by existing Factiva S3→PostgreSQL ingestion

https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa

claude and others added 19 commits March 28, 2026 12:06
Add support for ingesting OuestFrance ecology articles via their custom
XML API, converting to Factiva-format JSON, and uploading to S3 where the
existing Factiva S3→PostgreSQL pipeline processes them automatically.

Key changes:
- Pydantic schema (article_schema.py) for validating Factiva JSON at write time
- OuestFrance XML→S3 converter (api_to_s3.py) with full field mapping
- OUESTFR source code added to classification + Alembic migration
- Optional article_url/tags columns on factiva_articles & lemonde_ftp_articles
- LeMonde FTP retrofitted with Pydantic validation
- OuestFrance excluded from DBT dashboard models (no total article counts)
- Dockerfile and entrypoint for the OuestFrance API→S3 job

https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
… tests

- Fix test_stop_word_get_top_keywords_by_channel to match current keyword
  dictionary output (climatique, énergie fossile)
- Add 29 tests for Pydantic schema validation and OuestFrance XML parsing

https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
Merge Dockerfile_ouest_france and Dockerfile_lemonde_web into one
Dockerfile with a parameterized entrypoint script. The pipeline is
selected via DEFAULT_PIPELINE build arg (or CLI arg / env var at runtime).

- Dockerfile_lemonde_web defaults to "lemonde" (backward compatible)
- CI builds ouest_france image with --build-arg DEFAULT_PIPELINE=ouest_france
- Remove separate Dockerfile_ouest_france and per-pipeline entrypoint scripts
- Add unified docker-entrypoint-press-ingestion.sh

https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
- Rename Dockerfile_lemonde_web → Dockerfile_press_ingestion (generic)
- Restore separate docker-entrypoint-lemonde-web.sh (original)
- Keep separate docker-entrypoint-ouest-france.sh
- Remove unified docker-entrypoint-press-ingestion.sh
- Dockerfile accepts ENTRYPOINT_SCRIPT build arg to select which script
- CI builds lemonde_web image with default (lemonde) entrypoint
- CI builds ouest_france image with --build-arg ENTRYPOINT_SCRIPT
- Remove /tmp/ouest_france_articles from Dockerfile (created in Python)

https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
The Dockerfile just sets CMD to the lemonde script as default. Both
images are the same build — ouest_france is a tag alias. The Scaleway
job for OuestFrance overrides the command at runtime.

https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
…de_web

- Set executable bit on entrypoint scripts in git, remove chmod from Dockerfile
- Build and push a single press_ingestion image
- Retag as lemonde_web for backward compatibility with existing infra
- OuestFrance Scaleway job uses press_ingestion image directly
  (command overridden at runtime to ./docker-entrypoint-ouest-france.sh)

https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
Pydantic was overkill — we only used field validators (trivial
__post_init__ checks), model_dump (dataclasses.asdict), and
extra="allow" (speculative, removed). Standard dataclasses provide
the same typing guarantees with zero extra dependency.

- Rewrite article_schema.py using @DataClass
- Replace model_dump() → to_dict() in all callers
- Remove explicit pydantic dependency from pyproject.toml
  (still available transitively via langchain_core)
- Update tests accordingly (28 pass)

https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
Auto-detect between <contenus> (original) and RSS <rss><channel><item>
format. Key differences in RSS:
- <guid> → article ID (instead of <id>)
- <description> → body (instead of <texte>)
- <dc:creator> → byline (instead of <signature>)
- <pubDate> RFC 2822 → parsed to ISO 8601 (instead of <dateParution>)
- <enclosure url> → art (instead of <photos>)
- <link> → article_url (instead of <url>)
- word_count computed from body (no <nombreMots>)
- no snippet, tags, or photo credits in RSS format

https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
Contenus format (paper articles) → document_type="paper"
RSS format (web articles) → document_type="web"

This is stored in the existing document_type column in factiva_articles,
allowing queries to distinguish paper vs web OuestFrance articles.

https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
…essor

LeMonde articles don't have these fields — the nullable columns
default to NULL without explicit mapping.

https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
- <surtitre>: prepended to title if present (e.g. "Environnement - Title")
- <localisations>: location names stored in region_of_origin
  (comma-separated, defaults to "France" when empty)

Both formats now fully match the API documentation.

https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
Support multiple XML files via OUESTFRANCE_LOCAL_XML (directory, glob,
or comma-separated paths). Deduplicate articles by ID before export
to prevent the same article from being saved twice to S3.

https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
…dicators

- Add load_and_insert_monthly_stats() to read a JSON file with monthly
  web/paper article totals and distribute them evenly across days into
  stats_factiva_articles. This provides the denominator for ratio calculations.
- Wire it into main() via OUESTFRANCE_MONTHLY_STATS env var.
- Remove OUESTFR from the exclusion list in print_media_crises_indicators.sql,
  enabling OuestFrance in the daily and monthly dashboard models.

JSON format: [{"year": 2025, "month": 1, "web": 5000, "paper": 3000}, ...]

https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
…R (web)

Like LeMonde (LEMOND/LEMFR), OuestFrance now uses separate source codes:
- OUESTFRANCE for paper articles (contenus XML format)
- OUESTFRAFR for web articles (RSS XML format)

Both share media_all="Ouest-France" for grouped dashboard views.
Monthly stats are inserted separately per source code.

https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
Web articles (RSS/OUESTFRAFR) now use source_name="Ouest-France.fr"
to match the source_classification entry, while paper articles
(contenus/OUESTFRANCE) keep source_name="Ouest-France".

https://claude.ai/code/session_01KnzVvhxwaWJRmK2q6Ym3oa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants