feat(#61): DOWNLOAD_ALWAYS_WHITELIST — operator-controlled re-processing#323
Merged
Conversation
Adds ``ckanext.datapusher_plus.download_always_whitelist``: a whitespace-separated list of hostnames whose resources always get re-downloaded + re-analyzed, bypassing the file-hash-based upload-skip optimization in ``DownloadStage._should_skip_upload``. Use cases: * Hosts that update content in place without changing the published byte hash (e.g., a daily report that overwrites the same URL with the same template but new underlying data). * Local / peered hosts where re-download cost is negligible and operators want forced re-analysis on every push. Issue #61 was originally filed alongside a never-built ``DOWNLOAD_PREVIEW_ONLY`` partial-download mode (DP+'s architecture landed on full-file download for comprehensive metadata inference, so that parent feature is moot). Per the 2026-05-19 BACKLOG-SWEEP3 reinterpretation, the name is kept but the semantic is repurposed as a re-processing trigger — strictly additive to the existing ``IGNORE_FILE_HASH`` global override and the per-job ``ignore_hash`` metadata flag. Implementation: * ``config.py`` parses the value into a ``frozenset`` of lowercased hostnames at import time (mirrors the ``FORMATS`` list-config pattern; empty default preserves legacy behavior). * New ``_host_in_always_whitelist(url)`` helper in ``jobs/stages/download.py``: case-insensitive exact hostname match, port-stripped, defensive against missing/unparseable URLs (returns ``False`` rather than raising on the hot path). * ``_should_skip_upload`` short-circuits to ``False`` when the resource URL's host is in the whitelist, with an INFO log so operators can see WHY a whitelisted host bypassed the dedup path. * ``config_declaration.yaml`` declares the key as ``editable: true`` with operator-facing docs and an example. Tests (18, all passing — suite 251 → 269 Python): * Config parsing: empty / whitespace-string / list-value / empty-entries-filtered (4 tests). * Helper: exact match, case-insensitive, port-stripped, IP literal, unparseable URL, empty URL, ``None`` URL, hostless URL (8 tests). * Integration: ``_should_skip_upload`` bypasses skip when whitelisted, preserves skip when NOT whitelisted, preserves legacy behavior when whitelist is empty (3 tests). * Drift guard: ``config.py`` inline fallback agrees with ``config_declaration.yaml`` default — the #179-style declaration-vs-fallback regression that bit DP+ once before (1 test, AST-parsed to avoid CKAN bootstrap). Closes #61. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Roborev review #2293 surfaced 4 LOW findings on the initial DOWNLOAD_ALWAYS_WHITELIST implementation. All addressed: * **L1 — ``editable: true`` semantics:** The whitelist was declared ``editable: true`` (implying admin-UI live edits should take effect) but parsed once at module import into a frozenset constant. Operators editing via the admin UI would see no effect until worker restart — inconsistent with sibling ``file_hash_algorithm`` (#221) which is read live. Fix: removed the ``DOWNLOAD_ALWAYS_WHITELIST`` module constant from ``config.py`` (replaced with a docstring mirroring the ``file_hash_algorithm`` pattern). Added ``_get_download_always_whitelist()`` to ``download.py`` that reads + parses ``tk.config`` live on each call. Parse cost is negligible since hash-skip fires at most once per download. * **L2 — comment said ``set`` but constant was ``frozenset``:** Obsoleted by L1 (the comment block was rewritten). * **L3 — double ``urlparse(url)`` per skip-check:** The original ``_should_skip_upload`` called ``urlparse`` once inside ``_host_in_always_whitelist`` (for the match) and again in the ``info`` log to recover the hostname. Fix: ``_host_in_always_whitelist`` now returns ``Optional[str]`` — the matched lowercased host (port stripped) or ``None``. The caller passes that returned host directly into the log call. One urlparse per whitelist check. * **L4 — no test pinned subdomain non-matching semantics:** The implementation does exact-hostname matching by design (so ``data.gov`` does NOT match ``subdomain.data.gov``), but no test pinned that. A future "support subdomain wildcards" patch could silently change behavior. Fix: added two regression tests (``test_subdomain_does_not_match_parent_domain``, ``test_parent_domain_does_not_match_subdomain_entry``) and called out "**Exact-match only**" in the YAML declaration's description. Tests: 18 → 23 (+5): * Subdomain non-match × 2 directions (L4). * Live-read picks up runtime changes (L1 — proves the ``editable: true`` declaration is now truthful). * Malformed-value defense (a non-iterable int in tk.config degrades to empty frozenset, doesn't crash). * Empty-string parse split from the unset-key case. Full Python suite: 269 → 274, all passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an operator-configurable “always re-process” hostname whitelist to DataPusher+’s download deduplication logic, allowing certain hosts to bypass the file-hash upload-skip optimization (useful for in-place-updating sources and cheap/local downloads).
Changes:
- Introduces a live-read config parser (
_get_download_always_whitelist) and hostname matcher (_host_in_always_whitelist) in the download stage. - Updates
DownloadStage._should_skip_uploadto force re-processing (no skip) when the resource URL host is whitelisted, with an INFO log explaining the bypass. - Declares the new config key in
config_declaration.yamland adds a comprehensive regression test suite for issue #61 (including a YAML-vs-code default drift guard).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
ckanext/datapusher_plus/jobs/stages/download.py |
Adds live-read whitelist parsing + host matching; bypasses hash-skip when host is whitelisted. |
ckanext/datapusher_plus/config.py |
Documents why the new setting is intentionally not a module-level constant (live-read behavior). |
ckanext/datapusher_plus/config_declaration.yaml |
Declares ckanext.datapusher_plus.download_always_whitelist with operator-facing documentation and example. |
tests/test_issue_61_download_always_whitelist.py |
Adds regression tests for parsing, matching, _should_skip_upload behavior, and default drift guard. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ness Two LOW nits on tests/test_issue_61_download_always_whitelist.py from Copilot's review of PR #323: * **Brittle group-index lookup**: ``decl["groups"][0]["options"]`` assumed the target option lived in the first group, breaking the drift-guard if ``config_declaration.yaml`` ever grows additional groups or the key is moved between groups. Switched to iterating all ``groups`` / ``options`` to find the target key by name — matches the pattern in ``test_issue_112_decimal_comma.py:175-185``. Also added a descriptive assertion message ("key is missing from config_declaration.yaml — issue #61 should declare it") if the declaration is ever removed. * **Unconditional ``import yaml``**: Other drift-guard tests in the repo (``test_issue_112_*``, ``test_issue_142_*``) wrap ``import yaml`` in ``try/except ImportError → pytest.skip(...)`` so CI environments running unit tests without the full extension dep set degrade cleanly. Adopted the same pattern. Tests: still 23 in the file, full Python suite still 274 passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prefect 3.7.1's ``workers/base.py`` imports ``importlib_metadata``
unconditionally but doesn't declare it as a direct dependency. Until
this morning the import was satisfied transitively via
``opentelemetry-api`` 1.41.x; ``opentelemetry-api`` 1.42.0 (released
2026-05-19, ~11:00 UTC) dropped that transitive, so every PR CI run
since started failing in the "Start Prefect worker" step:
ModuleNotFoundError: No module named 'importlib_metadata'
Verified by diffing the ``pip install`` output between the last
successful main run (26077880669) and PR #323's failing runs — the
sole packaging difference is ``importlib-metadata-8.7.1`` present in
the former, absent in the latter, alongside ``opentelemetry-api``
bumping 1.41.1 → 1.42.0.
Fix: declare ``importlib_metadata`` as a direct DP+ requirement so
its presence doesn't depend on which transitive deps Prefect's
sibling packages happen to pull in this week. Pinned loose because
the backport's API is stable across versions.
This is functionally the same problem class as issue #149 — a
runtime dependency on a transitive package that was working by
accident until an upstream version bump exposed it. Different
package, same shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ckanext.datapusher_plus.download_always_whitelist— a whitespace-separated list of hostnames whose resources always trigger re-processing, bypassing the file-hash-based upload-skip optimization inDownloadStage._should_skip_upload.DOWNLOAD_PREVIEW_ONLYpartial-download mode. Per the BACKLOG-SWEEP3 reinterpretation (2026-05-19), the name is kept but repurposed as a re-processing trigger for hosts that update content in place without changing the byte hash, or for local/peered hosts where re-download is cheap.IGNORE_FILE_HASHglobal override and the per-jobignore_hashmetadata flag.Implementation notes
config.pyparses the value into afrozensetof lowercased hostnames at import time (mirrors theFORMATSlist-config pattern). Empty default preserves legacy behavior — operators upgrading in place with no config change get exactly the pre-Smart resource download #61 behavior._host_in_always_whitelist(url)helper injobs/stages/download.py: case-insensitive exact hostname match, port-stripped, defensive against missing/unparseable URLs (returnsFalserather than raising on the hot path)._should_skip_uploadshort-circuits toFalsewhen the resource URL's host is whitelisted, with an INFO log so operators can see WHY a whitelisted host bypassed the dedup path.config_declaration.yamldeclares the key aseditable: truewith operator-facing docs + example.Test plan
NoneURL, hostless URL (8 tests)._should_skip_uploadbypasses skip when whitelisted, preserves skip when NOT whitelisted, preserves legacy behavior when whitelist is empty (3 tests).config.pyinline fallback agrees withconfig_declaration.yamldefault (the Date without timestamp #179-style regression that bit DP+ once before — AST-parsed to avoid CKAN bootstrap).Closes #61.
🤖 Generated with Claude Code