Add multi-value GET-parameter chunker for waterdata OGC API by thodson-usgs · Pull Request #276 · DOI-USGS/dataretrieval-python

thodson-usgs · 2026-05-15T02:10:08Z

Summary

Adds a multi-value GET-parameter chunker (dataretrieval/waterdata/chunking.py) that splits cartesian combinations of multi-value list params across sub-requests so each URL fits the server's ~8 KB byte limit. Sits OUTSIDE @filters.chunked on _fetch_once, so list-chunking is the outer loop and CQL-filter-chunking is the inner loop. The two chunkers coordinate via a filter-aware probe.

Depends on #273. That PR fixes the pre-existing silent-truncation bug in _walk_pages whose blast radius this PR materially expands. Merge order: #273 → #233 → this PR. See "Merge ordering" below.

Why

PR #233 routes multi-value waterdata queries through GET with comma-separated values (e.g. monitoring_location_id=USGS-A,USGS-B,…). For real-world workloads (~300+ sites or pcodes), the URL blows past the server's 8 KB nginx buffer and the API returns HTTP 414. This PR makes long list params transparently fan out.

Design

@chunking.multi_value_chunked(build_request=_construct_api_requests)   # outer
@filters.chunked(build_request=_construct_api_requests)                # inner
def _fetch_once(args: dict) -> tuple[DataFrame, Response]: ...

Planner (_plan_chunks): greedy halving of the largest chunk across all dims until the worst-case sub-request URL fits the byte limit. Cartesian product of per-dim partitions becomes the sub-request set. O(log N) per dim.
Coordination with filters.chunked: when planning, the URL probe substitutes the filter with its longest top-level OR-clause via _filter_aware_probe_args. This models the per-sub-request URL that filters.chunked will actually emit — the filter chunker can shrink the filter down to one clause per sub-request but no smaller (it bails if any clause exceeds the budget). Without this coordination, a long OR-filter combined with multi-value lists triggered premature RequestTooLarge exceptions even when the combined chunkers would have made things fit.
Hard cap (max_chunks, default 1000): the cartesian product won't exceed this. Note: this caps the count of chunks, not the count of HTTP requests, because each chunk may paginate (each pagination is a separate HTTP request that counts against the hourly quota).
Quota-aware abort (quota_safety_floor, default 50): after every sub-request the wrapper reads x-ratelimit-remaining. If it drops below the floor, the wrapper raises QuotaExhausted carrying the combined partial frame, the chunk offset, and the last-observed remaining — letting callers either salvage what they have or resume from the next chunk after the hourly window resets. This is critical because of the pagination silent-truncation bug fixed in Fix silent / misleading failures in paginated waterdata pagination #273; without quota awareness the chunker could otherwise drive the API into 429 territory mid-call and (pre-Fix silent / misleading failures in paginated waterdata pagination #273) silently lose pagination pages.

Failure modes (post-fix)

Scenario	Result
URL fits without chunking	Decorator passes through; one request, no overhead beyond a single planning probe
Lists need splitting	Greedy halve → cartesian product of N sub-requests
Filter chunking also needed	Outer mv-loop × inner filter-loop; identical-row dedup at combine
Filter has clauses bigger than achievable budget	`RequestTooLarge` with explicit message
Cartesian product > `max_chunks`	`RequestTooLarge` with the actual count
`x-ratelimit-remaining` drops below floor mid-call	`QuotaExhausted` with `.partial_frame`, `.completed_chunks`, `.total_chunks`

Coordination empirically verified

Validated against api.waterdata.usgs.gov (worktree contains the three live test scripts):

Test	Setup	Result
3-dim equivalence	4 sites × 3 pcodes × 3 stats, `url_limit=220`	16 sub-requests, every axis confirmed split, chunked frame exactly equal to unchunked reference
Stress (6 edge cases)	Singletons, lopsided sizes, monitoring-locations POST path, RequestTooLarge floor, etc.	All pass
Filter/mv composition (3 regimes)	mv-only / filter-only / both with previously-failing `url_limit=220`	All pass — frames match references; the "both" case would have raised `RequestTooLarge` without the filter-aware probe

Test plan

Unit tests for _filter_aware_probe_args (8 cases)
Unit tests for _plan_chunks greedy halving, RequestTooLarge floor, coordination with filter chunker, max_chunks cap
Unit tests for multi_value_chunked pass-through, cartesian product shape, lazy URL-limit reads, max_chunks override
Unit tests for QuotaExhausted machinery (8 cases: header parsing, default floor, mid-call abort with partial frame, last-chunk-no-abort, zero-floor disable, message contents)
Defensive: filter and filter_lang excluded from _NEVER_CHUNK so a caller passing filter as a list can't produce malformed CQL
Live: 3-dim equivalence, 6-case stress matrix, 3-regime filter/mv composition all green against production API
No regression: existing _construct_api_requests_* and _check_* tests still pass

30 new + existing tests; runs offline (uses a deterministic fake build_request). Live scripts in /tmp of the development worktree, not in the test suite.

Merge ordering

Fix silent / misleading failures in paginated waterdata pagination #273 first (silent-truncation fix in _walk_pages). This chunker amplifies the frequency of the latent silent-truncation bug — without Fix silent / misleading failures in paginated waterdata pagination #273, every paginated sub-request inside the cartesian product becomes a chance for silent data loss when the hourly quota is exhausted. The chunker's QuotaExhausted guard mitigates this somewhat by aborting before the 429, but it's belt-and-suspenders, not a substitute.
Use GET with comma-separated values for multi-value waterdata queries #233 next (GET-with-commas routing). This chunker layers on top of Use GET with comma-separated values for multi-value waterdata queries #233's GET routing.
This PR last.

If the team prefers to land in a different order, this PR can ship without #273 — but the chunker should be considered "production-ready" only once #273 is merged.

Files changed

dataretrieval/waterdata/chunking.py (new) — decorator, planner, RequestTooLarge, QuotaExhausted
dataretrieval/waterdata/utils.py — wire @chunking.multi_value_chunked outside @filters.chunked on _fetch_once; updated docstring
tests/waterdata_test.py — 30 new tests covering planner, coordination, cap, quota machinery

Public API additions

dataretrieval.waterdata.chunking.RequestTooLarge — raised when planning can't make the URL fit
dataretrieval.waterdata.chunking.QuotaExhausted — raised mid-call when the quota safety floor is breached; carries .partial_frame, .partial_response, .completed_chunks, .total_chunks, .remaining
dataretrieval.waterdata.chunking.multi_value_chunked — decorator, in case downstream wraps a custom _fetch_once
Module constants _DEFAULT_MAX_CHUNKS = 1000, _DEFAULT_QUOTA_SAFETY_FLOOR = 50 (underscored — not part of the stable public API; available for monkey-patching by users with non-default quotas)

Known limitations (deferred)

The cap counts chunks, not HTTP requests. Paginated sub-requests each make multiple HTTP calls; a chunked call with 1000 chunks where each paginates 5× = 5000 HTTP requests. Document workaround: lower max_chunks proportionally for paginating workloads, or rely on the QuotaExhausted abort to catch overruns at runtime.
No auto-retry / no sleep-until-reset on QuotaExhausted. Caller decides. Could add on_quota_exhausted="sleep" mode in a follow-up if desired.
Sequential sub-requests. Parallelism would reduce wall time but complicates rate-limit handling.

🤖 Generated with Claude Code

The OGC API now supports comma-separated values for fields like monitoring_location_id, parameter_code, and statistic_id, making POST+CQL2 unnecessary for most services. Update _construct_api_requests to join list params with commas and use GET for daily, continuous, latest-daily, latest-continuous, field-measurements, time-series-metadata, and channel-measurements. The monitoring-locations endpoint does not yet support comma-separated GET parameters (returns 400); it retains the POST+CQL2 path. Closes DOI-USGS#210. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nit tests Style alignment with the rest of `waterdata/utils.py`: - Hoist `_cql2_required_services` from a function-local lowercase `set` to a module-level `_CQL2_REQUIRED_SERVICES = frozenset(...)` to match the convention of `_DATE_RANGE_PARAMS`, `_NO_NORMALIZE_PARAMS`, `_MONITORING_LOCATION_ID_RE`, etc. - Drop the "Legacy path:" prefix in the inline comment. POST/CQL2 is still the current and required path for monitoring-locations — the API team hasn't promised to add comma-GET there. Rephrased the two branches symmetrically ("POST with CQL2 JSON" / "GET with comma- separated values") so neither reads as deprecated. New unit tests: - `test_construct_api_requests_single_value_stays_get` — confirms a scalar `monitoring_location_id="USGS-..."` still produces a clean GET with no `%2C`, i.e. existing single-site callers see no change. - `test_construct_api_requests_numeric_list_joins_with_str` — pins down that `water_year=[2020, 2021]` reaches the URL as `water_year=2020%2C2021`, exercising the `str(x) for x in v` generator that exists specifically to handle non-string list params (without it, `",".join` on a list of ints would TypeError). - `test_construct_api_requests_two_element_date_list_becomes_interval` — pins down the contract that a two-element date list (`time=["2024-01-01", "2024-01-31"]`) is interpreted as start/end of an OGC datetime interval (joined with `/`), NOT as two discrete dates. The OGC `datetime` parameter doesn't support "these N specific dates" — that would require a CQL filter. Test exists so this semantic choice can't be silently changed.

Wraps _fetch_once with a cartesian-product chunker that sits OUTSIDE @filters.chunked. Splits multi-value list params (monitoring_location_id, parameter_code, statistic_id, etc.) across sub-requests so each URL fits the server's ~8 KB byte limit. Coordination with @filters.chunked: the planner's URL probe substitutes the filter with its longest top-level OR-clause via _filter_aware_probe_args, modeling the per-sub-request URL the inner filter chunker will actually emit. Without this coordination, a long OR-filter plus multi-value lists triggered premature RequestTooLarge even when the combined chunkers would have made things fit. Two safety guards: - max_chunks=1000 cap on cartesian-product size (matches USGS API hourly quota; raises RequestTooLarge with the actual count when exceeded). - QuotaExhausted abort: between sub-requests, reads x-ratelimit-remaining; if below quota_safety_floor (default 50), raises with the partial frame and chunk offset so callers can resume instead of crashing into a mid-call HTTP 429. 30 unit tests cover the planner, filter-aware coordination, the cap, and the quota-aware abort. Live tests in /tmp verify a 3-dim equivalence case (chunked == unchunked, 16 sub-requests, all axes split), 6 edge-case stress scenarios, and 3 mv/filter composition regimes. Depends on DOI-USGS#273 (paginated silent-truncation fix) — this PR multiplies the frequency at which the silent-truncation bug class would have surfaced. Merge order: DOI-USGS#273 -> DOI-USGS#233 -> this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The filter chunker (`filters.chunked`) splits a filter into chunks each ≤ the per-sub-request budget, but bails (returns the full filter unchanged) when ANY single OR-clause exceeds the budget. So the smallest filter size the inner chunker can guarantee to emit per sub-request is bounded below by the LARGEST single clause, not the smallest. The original implementation used `min(parts)` to model the chunker's output floor. For filters with uniform clause sizes (all my prior tests), min == max and the bug was hidden. For filters with lopsided clauses — e.g. `id='1' OR id='abcdef…long-string'` — using `min` would let the planner falsely declare a plan feasible. The inner chunker would then bail on the large clause, the real per-sub-request URL would carry the full filter, and the request would 414 server-side. Switch to `max(parts, key=len)`. If singleton+max-clause fits the URL limit, the inner chunker's budget is ≥ max(parts), so all clauses fit individually and chunking succeeds. If singleton+max-clause doesn't fit, the planner correctly raises `RequestTooLarge` instead of producing an unservable plan. Regression test: `test_plan_chunks_probes_with_max_clause_not_min`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The motivating user story for this PR is the same one R `dataRetrieval` covers in #870: pull a long monitoring-location list from one getter, feed it to another. Before chunking this fails with HTTP 414 once the URL grows past the server's ~8 KB limit; after it transparently fans out. - chunking.py: prepend a docstring example showing the Ohio-stream-sites → daily-discharge chained call, so readers landing on the module file see the motivating scenario immediately. - api.py get_daily: add the same chained example to the Examples block (where similar single-site and multi-site examples already live), so the most-used getter's docstring shows what just became possible. - NEWS.md: user-visible entry framing the change in terms of "this now works" — chained queries, transparent chunking, max_chunks cap, and QuotaExhausted resume. References R PR #870 as the analogous change. No code changes; pure docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

thodson-usgs and others added 5 commits May 14, 2026 08:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-value GET-parameter chunker for waterdata OGC API#276

Add multi-value GET-parameter chunker for waterdata OGC API#276
thodson-usgs wants to merge 5 commits into
DOI-USGS:mainfrom
thodson-usgs:multivalue-chunker

thodson-usgs commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thodson-usgs commented May 15, 2026

Summary

Why

Design

Failure modes (post-fix)

Coordination empirically verified

Test plan

Merge ordering

Files changed

Public API additions

Known limitations (deferred)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant