Skip to content

feat(#61): DOWNLOAD_ALWAYS_WHITELIST — operator-controlled re-processing#323

Merged
jqnatividad merged 4 commits into
mainfrom
fix-61-download-always-whitelist
May 19, 2026
Merged

feat(#61): DOWNLOAD_ALWAYS_WHITELIST — operator-controlled re-processing#323
jqnatividad merged 4 commits into
mainfrom
fix-61-download-always-whitelist

Conversation

@jqnatividad
Copy link
Copy Markdown
Collaborator

Summary

  • Adds ckanext.datapusher_plus.download_always_whitelist — a whitespace-separated list of hostnames whose resources always trigger re-processing, bypassing the file-hash-based upload-skip optimization in DownloadStage._should_skip_upload.
  • The original Smart resource download #61 (Feb 2023) paired this with a never-built DOWNLOAD_PREVIEW_ONLY partial-download mode. Per the BACKLOG-SWEEP3 reinterpretation (2026-05-19), the name is kept but repurposed as a re-processing trigger for hosts that update content in place without changing the byte hash, or for local/peered hosts where re-download is cheap.
  • Strictly additive to the existing IGNORE_FILE_HASH global override and the per-job ignore_hash metadata flag.

Implementation notes

  • config.py parses the value into a frozenset of lowercased hostnames at import time (mirrors the FORMATS list-config pattern). Empty default preserves legacy behavior — operators upgrading in place with no config change get exactly the pre-Smart resource download #61 behavior.
  • New _host_in_always_whitelist(url) helper in jobs/stages/download.py: case-insensitive exact hostname match, port-stripped, defensive against missing/unparseable URLs (returns False rather than raising on the hot path).
  • _should_skip_upload short-circuits to False when the resource URL's host is whitelisted, with an INFO log so operators can see WHY a whitelisted host bypassed the dedup path.
  • config_declaration.yaml declares the key as editable: true with operator-facing docs + example.

Test plan

  • Config parsing: empty / whitespace-string / list-value / empty-entries-filtered (4 tests).
  • Helper: exact match, case-insensitive, port-stripped, IP literal, unparseable URL, empty URL, None URL, hostless URL (8 tests).
  • Integration: _should_skip_upload bypasses skip when whitelisted, preserves skip when NOT whitelisted, preserves legacy behavior when whitelist is empty (3 tests).
  • Declaration-vs-fallback drift guard: config.py inline fallback agrees with config_declaration.yaml default (the Date without timestamp #179-style regression that bit DP+ once before — AST-parsed to avoid CKAN bootstrap).
  • Full suite: 251 → 269 Python tests, all passing.

Closes #61.

🤖 Generated with Claude Code

jqnatividad and others added 2 commits May 19, 2026 07:27
Adds ``ckanext.datapusher_plus.download_always_whitelist``: a
whitespace-separated list of hostnames whose resources always
get re-downloaded + re-analyzed, bypassing the file-hash-based
upload-skip optimization in ``DownloadStage._should_skip_upload``.

Use cases:

* Hosts that update content in place without changing the
  published byte hash (e.g., a daily report that overwrites
  the same URL with the same template but new underlying data).
* Local / peered hosts where re-download cost is negligible
  and operators want forced re-analysis on every push.

Issue #61 was originally filed alongside a never-built
``DOWNLOAD_PREVIEW_ONLY`` partial-download mode (DP+'s architecture
landed on full-file download for comprehensive metadata inference,
so that parent feature is moot). Per the 2026-05-19 BACKLOG-SWEEP3
reinterpretation, the name is kept but the semantic is repurposed
as a re-processing trigger — strictly additive to the existing
``IGNORE_FILE_HASH`` global override and the per-job
``ignore_hash`` metadata flag.

Implementation:

* ``config.py`` parses the value into a ``frozenset`` of
  lowercased hostnames at import time (mirrors the ``FORMATS``
  list-config pattern; empty default preserves legacy behavior).
* New ``_host_in_always_whitelist(url)`` helper in
  ``jobs/stages/download.py``: case-insensitive exact hostname
  match, port-stripped, defensive against missing/unparseable
  URLs (returns ``False`` rather than raising on the hot path).
* ``_should_skip_upload`` short-circuits to ``False`` when the
  resource URL's host is in the whitelist, with an INFO log so
  operators can see WHY a whitelisted host bypassed the dedup
  path.
* ``config_declaration.yaml`` declares the key as ``editable:
  true`` with operator-facing docs and an example.

Tests (18, all passing — suite 251 → 269 Python):

* Config parsing: empty / whitespace-string / list-value /
  empty-entries-filtered (4 tests).
* Helper: exact match, case-insensitive, port-stripped, IP
  literal, unparseable URL, empty URL, ``None`` URL, hostless
  URL (8 tests).
* Integration: ``_should_skip_upload`` bypasses skip when
  whitelisted, preserves skip when NOT whitelisted, preserves
  legacy behavior when whitelist is empty (3 tests).
* Drift guard: ``config.py`` inline fallback agrees with
  ``config_declaration.yaml`` default — the #179-style
  declaration-vs-fallback regression that bit DP+ once before
  (1 test, AST-parsed to avoid CKAN bootstrap).

Closes #61.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Roborev review #2293 surfaced 4 LOW findings on the initial
DOWNLOAD_ALWAYS_WHITELIST implementation. All addressed:

* **L1 — ``editable: true`` semantics:** The whitelist was declared
  ``editable: true`` (implying admin-UI live edits should take
  effect) but parsed once at module import into a frozenset
  constant. Operators editing via the admin UI would see no effect
  until worker restart — inconsistent with sibling
  ``file_hash_algorithm`` (#221) which is read live.

  Fix: removed the ``DOWNLOAD_ALWAYS_WHITELIST`` module constant
  from ``config.py`` (replaced with a docstring mirroring the
  ``file_hash_algorithm`` pattern). Added
  ``_get_download_always_whitelist()`` to ``download.py`` that
  reads + parses ``tk.config`` live on each call. Parse cost is
  negligible since hash-skip fires at most once per download.

* **L2 — comment said ``set`` but constant was ``frozenset``:**
  Obsoleted by L1 (the comment block was rewritten).

* **L3 — double ``urlparse(url)`` per skip-check:** The original
  ``_should_skip_upload`` called ``urlparse`` once inside
  ``_host_in_always_whitelist`` (for the match) and again in the
  ``info`` log to recover the hostname.

  Fix: ``_host_in_always_whitelist`` now returns ``Optional[str]``
  — the matched lowercased host (port stripped) or ``None``. The
  caller passes that returned host directly into the log call.
  One urlparse per whitelist check.

* **L4 — no test pinned subdomain non-matching semantics:** The
  implementation does exact-hostname matching by design (so
  ``data.gov`` does NOT match ``subdomain.data.gov``), but no
  test pinned that. A future "support subdomain wildcards" patch
  could silently change behavior.

  Fix: added two regression tests
  (``test_subdomain_does_not_match_parent_domain``,
  ``test_parent_domain_does_not_match_subdomain_entry``) and
  called out "**Exact-match only**" in the YAML declaration's
  description.

Tests: 18 → 23 (+5):
* Subdomain non-match × 2 directions (L4).
* Live-read picks up runtime changes (L1 — proves the
  ``editable: true`` declaration is now truthful).
* Malformed-value defense (a non-iterable int in tk.config
  degrades to empty frozenset, doesn't crash).
* Empty-string parse split from the unset-key case.

Full Python suite: 269 → 274, all passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an operator-configurable “always re-process” hostname whitelist to DataPusher+’s download deduplication logic, allowing certain hosts to bypass the file-hash upload-skip optimization (useful for in-place-updating sources and cheap/local downloads).

Changes:

  • Introduces a live-read config parser (_get_download_always_whitelist) and hostname matcher (_host_in_always_whitelist) in the download stage.
  • Updates DownloadStage._should_skip_upload to force re-processing (no skip) when the resource URL host is whitelisted, with an INFO log explaining the bypass.
  • Declares the new config key in config_declaration.yaml and adds a comprehensive regression test suite for issue #61 (including a YAML-vs-code default drift guard).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
ckanext/datapusher_plus/jobs/stages/download.py Adds live-read whitelist parsing + host matching; bypasses hash-skip when host is whitelisted.
ckanext/datapusher_plus/config.py Documents why the new setting is intentionally not a module-level constant (live-read behavior).
ckanext/datapusher_plus/config_declaration.yaml Declares ckanext.datapusher_plus.download_always_whitelist with operator-facing documentation and example.
tests/test_issue_61_download_always_whitelist.py Adds regression tests for parsing, matching, _should_skip_upload behavior, and default drift guard.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_issue_61_download_always_whitelist.py Outdated
Comment thread tests/test_issue_61_download_always_whitelist.py Outdated
jqnatividad and others added 2 commits May 19, 2026 10:57
…ness

Two LOW nits on tests/test_issue_61_download_always_whitelist.py
from Copilot's review of PR #323:

* **Brittle group-index lookup**:
  ``decl["groups"][0]["options"]`` assumed the target option lived
  in the first group, breaking the drift-guard if
  ``config_declaration.yaml`` ever grows additional groups or the
  key is moved between groups. Switched to iterating all
  ``groups`` / ``options`` to find the target key by name —
  matches the pattern in
  ``test_issue_112_decimal_comma.py:175-185``. Also added a
  descriptive assertion message ("key is missing from
  config_declaration.yaml — issue #61 should declare it") if the
  declaration is ever removed.

* **Unconditional ``import yaml``**:
  Other drift-guard tests in the repo (``test_issue_112_*``,
  ``test_issue_142_*``) wrap ``import yaml`` in
  ``try/except ImportError → pytest.skip(...)`` so CI
  environments running unit tests without the full extension dep
  set degrade cleanly. Adopted the same pattern.

Tests: still 23 in the file, full Python suite still 274 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prefect 3.7.1's ``workers/base.py`` imports ``importlib_metadata``
unconditionally but doesn't declare it as a direct dependency. Until
this morning the import was satisfied transitively via
``opentelemetry-api`` 1.41.x; ``opentelemetry-api`` 1.42.0 (released
2026-05-19, ~11:00 UTC) dropped that transitive, so every PR CI run
since started failing in the "Start Prefect worker" step:

    ModuleNotFoundError: No module named 'importlib_metadata'

Verified by diffing the ``pip install`` output between the last
successful main run (26077880669) and PR #323's failing runs — the
sole packaging difference is ``importlib-metadata-8.7.1`` present in
the former, absent in the latter, alongside ``opentelemetry-api``
bumping 1.41.1 → 1.42.0.

Fix: declare ``importlib_metadata`` as a direct DP+ requirement so
its presence doesn't depend on which transitive deps Prefect's
sibling packages happen to pull in this week. Pinned loose because
the backport's API is stable across versions.

This is functionally the same problem class as issue #149 — a
runtime dependency on a transitive package that was working by
accident until an upstream version bump exposed it. Different
package, same shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jqnatividad jqnatividad merged commit fac22aa into main May 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Smart resource download

2 participants