Skip to content

Add multi-value GET-parameter chunker for waterdata OGC API#276

Draft
thodson-usgs wants to merge 5 commits into
DOI-USGS:mainfrom
thodson-usgs:multivalue-chunker
Draft

Add multi-value GET-parameter chunker for waterdata OGC API#276
thodson-usgs wants to merge 5 commits into
DOI-USGS:mainfrom
thodson-usgs:multivalue-chunker

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

Summary

Adds a multi-value GET-parameter chunker (dataretrieval/waterdata/chunking.py) that splits cartesian combinations of multi-value list params across sub-requests so each URL fits the server's ~8 KB byte limit. Sits OUTSIDE @filters.chunked on _fetch_once, so list-chunking is the outer loop and CQL-filter-chunking is the inner loop. The two chunkers coordinate via a filter-aware probe.

Depends on #273. That PR fixes the pre-existing silent-truncation bug in _walk_pages whose blast radius this PR materially expands. Merge order: #273#233 → this PR. See "Merge ordering" below.

Why

PR #233 routes multi-value waterdata queries through GET with comma-separated values (e.g. monitoring_location_id=USGS-A,USGS-B,…). For real-world workloads (~300+ sites or pcodes), the URL blows past the server's 8 KB nginx buffer and the API returns HTTP 414. This PR makes long list params transparently fan out.

Design

@chunking.multi_value_chunked(build_request=_construct_api_requests)   # outer
@filters.chunked(build_request=_construct_api_requests)                # inner
def _fetch_once(args: dict) -> tuple[DataFrame, Response]: ...
  • Planner (_plan_chunks): greedy halving of the largest chunk across all dims until the worst-case sub-request URL fits the byte limit. Cartesian product of per-dim partitions becomes the sub-request set. O(log N) per dim.
  • Coordination with filters.chunked: when planning, the URL probe substitutes the filter with its longest top-level OR-clause via _filter_aware_probe_args. This models the per-sub-request URL that filters.chunked will actually emit — the filter chunker can shrink the filter down to one clause per sub-request but no smaller (it bails if any clause exceeds the budget). Without this coordination, a long OR-filter combined with multi-value lists triggered premature RequestTooLarge exceptions even when the combined chunkers would have made things fit.
  • Hard cap (max_chunks, default 1000): the cartesian product won't exceed this. Note: this caps the count of chunks, not the count of HTTP requests, because each chunk may paginate (each pagination is a separate HTTP request that counts against the hourly quota).
  • Quota-aware abort (quota_safety_floor, default 50): after every sub-request the wrapper reads x-ratelimit-remaining. If it drops below the floor, the wrapper raises QuotaExhausted carrying the combined partial frame, the chunk offset, and the last-observed remaining — letting callers either salvage what they have or resume from the next chunk after the hourly window resets. This is critical because of the pagination silent-truncation bug fixed in Fix silent / misleading failures in paginated waterdata pagination #273; without quota awareness the chunker could otherwise drive the API into 429 territory mid-call and (pre-Fix silent / misleading failures in paginated waterdata pagination #273) silently lose pagination pages.

Failure modes (post-fix)

Scenario Result
URL fits without chunking Decorator passes through; one request, no overhead beyond a single planning probe
Lists need splitting Greedy halve → cartesian product of N sub-requests
Filter chunking also needed Outer mv-loop × inner filter-loop; identical-row dedup at combine
Filter has clauses bigger than achievable budget RequestTooLarge with explicit message
Cartesian product > max_chunks RequestTooLarge with the actual count
x-ratelimit-remaining drops below floor mid-call QuotaExhausted with .partial_frame, .completed_chunks, .total_chunks

Coordination empirically verified

Validated against api.waterdata.usgs.gov (worktree contains the three live test scripts):

Test Setup Result
3-dim equivalence 4 sites × 3 pcodes × 3 stats, url_limit=220 16 sub-requests, every axis confirmed split, chunked frame exactly equal to unchunked reference
Stress (6 edge cases) Singletons, lopsided sizes, monitoring-locations POST path, RequestTooLarge floor, etc. All pass
Filter/mv composition (3 regimes) mv-only / filter-only / both with previously-failing url_limit=220 All pass — frames match references; the "both" case would have raised RequestTooLarge without the filter-aware probe

Test plan

  • Unit tests for _filter_aware_probe_args (8 cases)
  • Unit tests for _plan_chunks greedy halving, RequestTooLarge floor, coordination with filter chunker, max_chunks cap
  • Unit tests for multi_value_chunked pass-through, cartesian product shape, lazy URL-limit reads, max_chunks override
  • Unit tests for QuotaExhausted machinery (8 cases: header parsing, default floor, mid-call abort with partial frame, last-chunk-no-abort, zero-floor disable, message contents)
  • Defensive: filter and filter_lang excluded from _NEVER_CHUNK so a caller passing filter as a list can't produce malformed CQL
  • Live: 3-dim equivalence, 6-case stress matrix, 3-regime filter/mv composition all green against production API
  • No regression: existing _construct_api_requests_* and _check_* tests still pass

30 new + existing tests; runs offline (uses a deterministic fake build_request). Live scripts in /tmp of the development worktree, not in the test suite.

Merge ordering

If the team prefers to land in a different order, this PR can ship without #273 — but the chunker should be considered "production-ready" only once #273 is merged.

Files changed

  • dataretrieval/waterdata/chunking.py (new) — decorator, planner, RequestTooLarge, QuotaExhausted
  • dataretrieval/waterdata/utils.py — wire @chunking.multi_value_chunked outside @filters.chunked on _fetch_once; updated docstring
  • tests/waterdata_test.py — 30 new tests covering planner, coordination, cap, quota machinery

Public API additions

  • dataretrieval.waterdata.chunking.RequestTooLarge — raised when planning can't make the URL fit
  • dataretrieval.waterdata.chunking.QuotaExhausted — raised mid-call when the quota safety floor is breached; carries .partial_frame, .partial_response, .completed_chunks, .total_chunks, .remaining
  • dataretrieval.waterdata.chunking.multi_value_chunked — decorator, in case downstream wraps a custom _fetch_once
  • Module constants _DEFAULT_MAX_CHUNKS = 1000, _DEFAULT_QUOTA_SAFETY_FLOOR = 50 (underscored — not part of the stable public API; available for monkey-patching by users with non-default quotas)

Known limitations (deferred)

  • The cap counts chunks, not HTTP requests. Paginated sub-requests each make multiple HTTP calls; a chunked call with 1000 chunks where each paginates 5× = 5000 HTTP requests. Document workaround: lower max_chunks proportionally for paginating workloads, or rely on the QuotaExhausted abort to catch overruns at runtime.
  • No auto-retry / no sleep-until-reset on QuotaExhausted. Caller decides. Could add on_quota_exhausted="sleep" mode in a follow-up if desired.
  • Sequential sub-requests. Parallelism would reduce wall time but complicates rate-limit handling.

🤖 Generated with Claude Code

thodson-usgs and others added 5 commits May 14, 2026 08:24
The OGC API now supports comma-separated values for fields like
monitoring_location_id, parameter_code, and statistic_id, making POST+CQL2
unnecessary for most services. Update _construct_api_requests to join list
params with commas and use GET for daily, continuous, latest-daily,
latest-continuous, field-measurements, time-series-metadata, and
channel-measurements.

The monitoring-locations endpoint does not yet support comma-separated GET
parameters (returns 400); it retains the POST+CQL2 path. Closes DOI-USGS#210.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nit tests

Style alignment with the rest of `waterdata/utils.py`:

- Hoist `_cql2_required_services` from a function-local lowercase `set`
  to a module-level `_CQL2_REQUIRED_SERVICES = frozenset(...)` to match
  the convention of `_DATE_RANGE_PARAMS`, `_NO_NORMALIZE_PARAMS`,
  `_MONITORING_LOCATION_ID_RE`, etc.
- Drop the "Legacy path:" prefix in the inline comment. POST/CQL2 is
  still the current and required path for monitoring-locations — the
  API team hasn't promised to add comma-GET there. Rephrased the two
  branches symmetrically ("POST with CQL2 JSON" / "GET with comma-
  separated values") so neither reads as deprecated.

New unit tests:

- `test_construct_api_requests_single_value_stays_get` — confirms a
  scalar `monitoring_location_id="USGS-..."` still produces a clean GET
  with no `%2C`, i.e. existing single-site callers see no change.
- `test_construct_api_requests_numeric_list_joins_with_str` — pins down
  that `water_year=[2020, 2021]` reaches the URL as `water_year=2020%2C2021`,
  exercising the `str(x) for x in v` generator that exists specifically
  to handle non-string list params (without it, `",".join` on a list of
  ints would TypeError).
- `test_construct_api_requests_two_element_date_list_becomes_interval` —
  pins down the contract that a two-element date list (`time=["2024-01-01",
  "2024-01-31"]`) is interpreted as start/end of an OGC datetime interval
  (joined with `/`), NOT as two discrete dates. The OGC `datetime`
  parameter doesn't support "these N specific dates" — that would
  require a CQL filter. Test exists so this semantic choice can't be
  silently changed.
Wraps _fetch_once with a cartesian-product chunker that sits OUTSIDE
@filters.chunked. Splits multi-value list params (monitoring_location_id,
parameter_code, statistic_id, etc.) across sub-requests so each URL fits
the server's ~8 KB byte limit.

Coordination with @filters.chunked: the planner's URL probe substitutes
the filter with its longest top-level OR-clause via _filter_aware_probe_args,
modeling the per-sub-request URL the inner filter chunker will actually
emit. Without this coordination, a long OR-filter plus multi-value lists
triggered premature RequestTooLarge even when the combined chunkers would
have made things fit.

Two safety guards:
- max_chunks=1000 cap on cartesian-product size (matches USGS API hourly
  quota; raises RequestTooLarge with the actual count when exceeded).
- QuotaExhausted abort: between sub-requests, reads x-ratelimit-remaining;
  if below quota_safety_floor (default 50), raises with the partial frame
  and chunk offset so callers can resume instead of crashing into a
  mid-call HTTP 429.

30 unit tests cover the planner, filter-aware coordination, the cap, and
the quota-aware abort. Live tests in /tmp verify a 3-dim equivalence case
(chunked == unchunked, 16 sub-requests, all axes split), 6 edge-case
stress scenarios, and 3 mv/filter composition regimes.

Depends on DOI-USGS#273 (paginated silent-truncation fix) — this PR multiplies
the frequency at which the silent-truncation bug class would have
surfaced. Merge order: DOI-USGS#273 -> DOI-USGS#233 -> this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The filter chunker (`filters.chunked`) splits a filter into chunks each
≤ the per-sub-request budget, but bails (returns the full filter
unchanged) when ANY single OR-clause exceeds the budget. So the smallest
filter size the inner chunker can guarantee to emit per sub-request is
bounded below by the LARGEST single clause, not the smallest.

The original implementation used `min(parts)` to model the chunker's
output floor. For filters with uniform clause sizes (all my prior
tests), min == max and the bug was hidden. For filters with lopsided
clauses — e.g. `id='1' OR id='abcdef…long-string'` — using `min` would
let the planner falsely declare a plan feasible. The inner chunker
would then bail on the large clause, the real per-sub-request URL
would carry the full filter, and the request would 414 server-side.

Switch to `max(parts, key=len)`. If singleton+max-clause fits the URL
limit, the inner chunker's budget is ≥ max(parts), so all clauses fit
individually and chunking succeeds. If singleton+max-clause doesn't
fit, the planner correctly raises `RequestTooLarge` instead of
producing an unservable plan.

Regression test: `test_plan_chunks_probes_with_max_clause_not_min`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The motivating user story for this PR is the same one R `dataRetrieval`
covers in #870: pull a long monitoring-location list from one getter,
feed it to another. Before chunking this fails with HTTP 414 once the
URL grows past the server's ~8 KB limit; after it transparently fans out.

- chunking.py: prepend a docstring example showing the Ohio-stream-sites
  → daily-discharge chained call, so readers landing on the module file
  see the motivating scenario immediately.
- api.py get_daily: add the same chained example to the Examples block
  (where similar single-site and multi-site examples already live), so
  the most-used getter's docstring shows what just became possible.
- NEWS.md: user-visible entry framing the change in terms of "this now
  works" — chained queries, transparent chunking, max_chunks cap, and
  QuotaExhausted resume. References R PR #870 as the analogous change.

No code changes; pure docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant