Add multi-value GET-parameter chunker for waterdata OGC API#276
Draft
thodson-usgs wants to merge 5 commits into
Draft
Add multi-value GET-parameter chunker for waterdata OGC API#276thodson-usgs wants to merge 5 commits into
thodson-usgs wants to merge 5 commits into
Conversation
The OGC API now supports comma-separated values for fields like monitoring_location_id, parameter_code, and statistic_id, making POST+CQL2 unnecessary for most services. Update _construct_api_requests to join list params with commas and use GET for daily, continuous, latest-daily, latest-continuous, field-measurements, time-series-metadata, and channel-measurements. The monitoring-locations endpoint does not yet support comma-separated GET parameters (returns 400); it retains the POST+CQL2 path. Closes DOI-USGS#210. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nit tests
Style alignment with the rest of `waterdata/utils.py`:
- Hoist `_cql2_required_services` from a function-local lowercase `set`
to a module-level `_CQL2_REQUIRED_SERVICES = frozenset(...)` to match
the convention of `_DATE_RANGE_PARAMS`, `_NO_NORMALIZE_PARAMS`,
`_MONITORING_LOCATION_ID_RE`, etc.
- Drop the "Legacy path:" prefix in the inline comment. POST/CQL2 is
still the current and required path for monitoring-locations — the
API team hasn't promised to add comma-GET there. Rephrased the two
branches symmetrically ("POST with CQL2 JSON" / "GET with comma-
separated values") so neither reads as deprecated.
New unit tests:
- `test_construct_api_requests_single_value_stays_get` — confirms a
scalar `monitoring_location_id="USGS-..."` still produces a clean GET
with no `%2C`, i.e. existing single-site callers see no change.
- `test_construct_api_requests_numeric_list_joins_with_str` — pins down
that `water_year=[2020, 2021]` reaches the URL as `water_year=2020%2C2021`,
exercising the `str(x) for x in v` generator that exists specifically
to handle non-string list params (without it, `",".join` on a list of
ints would TypeError).
- `test_construct_api_requests_two_element_date_list_becomes_interval` —
pins down the contract that a two-element date list (`time=["2024-01-01",
"2024-01-31"]`) is interpreted as start/end of an OGC datetime interval
(joined with `/`), NOT as two discrete dates. The OGC `datetime`
parameter doesn't support "these N specific dates" — that would
require a CQL filter. Test exists so this semantic choice can't be
silently changed.
Wraps _fetch_once with a cartesian-product chunker that sits OUTSIDE @filters.chunked. Splits multi-value list params (monitoring_location_id, parameter_code, statistic_id, etc.) across sub-requests so each URL fits the server's ~8 KB byte limit. Coordination with @filters.chunked: the planner's URL probe substitutes the filter with its longest top-level OR-clause via _filter_aware_probe_args, modeling the per-sub-request URL the inner filter chunker will actually emit. Without this coordination, a long OR-filter plus multi-value lists triggered premature RequestTooLarge even when the combined chunkers would have made things fit. Two safety guards: - max_chunks=1000 cap on cartesian-product size (matches USGS API hourly quota; raises RequestTooLarge with the actual count when exceeded). - QuotaExhausted abort: between sub-requests, reads x-ratelimit-remaining; if below quota_safety_floor (default 50), raises with the partial frame and chunk offset so callers can resume instead of crashing into a mid-call HTTP 429. 30 unit tests cover the planner, filter-aware coordination, the cap, and the quota-aware abort. Live tests in /tmp verify a 3-dim equivalence case (chunked == unchunked, 16 sub-requests, all axes split), 6 edge-case stress scenarios, and 3 mv/filter composition regimes. Depends on DOI-USGS#273 (paginated silent-truncation fix) — this PR multiplies the frequency at which the silent-truncation bug class would have surfaced. Merge order: DOI-USGS#273 -> DOI-USGS#233 -> this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The filter chunker (`filters.chunked`) splits a filter into chunks each ≤ the per-sub-request budget, but bails (returns the full filter unchanged) when ANY single OR-clause exceeds the budget. So the smallest filter size the inner chunker can guarantee to emit per sub-request is bounded below by the LARGEST single clause, not the smallest. The original implementation used `min(parts)` to model the chunker's output floor. For filters with uniform clause sizes (all my prior tests), min == max and the bug was hidden. For filters with lopsided clauses — e.g. `id='1' OR id='abcdef…long-string'` — using `min` would let the planner falsely declare a plan feasible. The inner chunker would then bail on the large clause, the real per-sub-request URL would carry the full filter, and the request would 414 server-side. Switch to `max(parts, key=len)`. If singleton+max-clause fits the URL limit, the inner chunker's budget is ≥ max(parts), so all clauses fit individually and chunking succeeds. If singleton+max-clause doesn't fit, the planner correctly raises `RequestTooLarge` instead of producing an unservable plan. Regression test: `test_plan_chunks_probes_with_max_clause_not_min`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The motivating user story for this PR is the same one R `dataRetrieval` covers in #870: pull a long monitoring-location list from one getter, feed it to another. Before chunking this fails with HTTP 414 once the URL grows past the server's ~8 KB limit; after it transparently fans out. - chunking.py: prepend a docstring example showing the Ohio-stream-sites → daily-discharge chained call, so readers landing on the module file see the motivating scenario immediately. - api.py get_daily: add the same chained example to the Examples block (where similar single-site and multi-site examples already live), so the most-used getter's docstring shows what just became possible. - NEWS.md: user-visible entry framing the change in terms of "this now works" — chained queries, transparent chunking, max_chunks cap, and QuotaExhausted resume. References R PR #870 as the analogous change. No code changes; pure docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a multi-value GET-parameter chunker (
dataretrieval/waterdata/chunking.py) that splits cartesian combinations of multi-value list params across sub-requests so each URL fits the server's ~8 KB byte limit. Sits OUTSIDE@filters.chunkedon_fetch_once, so list-chunking is the outer loop and CQL-filter-chunking is the inner loop. The two chunkers coordinate via a filter-aware probe.Depends on #273. That PR fixes the pre-existing silent-truncation bug in
_walk_pageswhose blast radius this PR materially expands. Merge order: #273 → #233 → this PR. See "Merge ordering" below.Why
PR #233 routes multi-value waterdata queries through GET with comma-separated values (e.g.
monitoring_location_id=USGS-A,USGS-B,…). For real-world workloads (~300+ sites or pcodes), the URL blows past the server's 8 KB nginx buffer and the API returns HTTP 414. This PR makes long list params transparently fan out.Design
_plan_chunks): greedy halving of the largest chunk across all dims until the worst-case sub-request URL fits the byte limit. Cartesian product of per-dim partitions becomes the sub-request set. O(log N) per dim.filters.chunked: when planning, the URL probe substitutes the filter with its longest top-level OR-clause via_filter_aware_probe_args. This models the per-sub-request URL thatfilters.chunkedwill actually emit — the filter chunker can shrink the filter down to one clause per sub-request but no smaller (it bails if any clause exceeds the budget). Without this coordination, a long OR-filter combined with multi-value lists triggered prematureRequestTooLargeexceptions even when the combined chunkers would have made things fit.max_chunks, default 1000): the cartesian product won't exceed this. Note: this caps the count of chunks, not the count of HTTP requests, because each chunk may paginate (each pagination is a separate HTTP request that counts against the hourly quota).quota_safety_floor, default 50): after every sub-request the wrapper readsx-ratelimit-remaining. If it drops below the floor, the wrapper raisesQuotaExhaustedcarrying the combined partial frame, the chunk offset, and the last-observedremaining— letting callers either salvage what they have or resume from the next chunk after the hourly window resets. This is critical because of the pagination silent-truncation bug fixed in Fix silent / misleading failures in paginated waterdata pagination #273; without quota awareness the chunker could otherwise drive the API into 429 territory mid-call and (pre-Fix silent / misleading failures in paginated waterdata pagination #273) silently lose pagination pages.Failure modes (post-fix)
RequestTooLargewith explicit messagemax_chunksRequestTooLargewith the actual countx-ratelimit-remainingdrops below floor mid-callQuotaExhaustedwith.partial_frame,.completed_chunks,.total_chunksCoordination empirically verified
Validated against
api.waterdata.usgs.gov(worktree contains the three live test scripts):url_limit=220url_limit=220RequestTooLargewithout the filter-aware probeTest plan
_filter_aware_probe_args(8 cases)_plan_chunksgreedy halving, RequestTooLarge floor, coordination with filter chunker, max_chunks capmulti_value_chunkedpass-through, cartesian product shape, lazy URL-limit reads, max_chunks overrideQuotaExhaustedmachinery (8 cases: header parsing, default floor, mid-call abort with partial frame, last-chunk-no-abort, zero-floor disable, message contents)filterandfilter_langexcluded from_NEVER_CHUNKso a caller passingfilteras a list can't produce malformed CQL_construct_api_requests_*and_check_*tests still pass30 new + existing tests; runs offline (uses a deterministic fake
build_request). Live scripts in/tmpof the development worktree, not in the test suite.Merge ordering
_walk_pages). This chunker amplifies the frequency of the latent silent-truncation bug — without Fix silent / misleading failures in paginated waterdata pagination #273, every paginated sub-request inside the cartesian product becomes a chance for silent data loss when the hourly quota is exhausted. The chunker'sQuotaExhaustedguard mitigates this somewhat by aborting before the 429, but it's belt-and-suspenders, not a substitute.If the team prefers to land in a different order, this PR can ship without #273 — but the chunker should be considered "production-ready" only once #273 is merged.
Files changed
dataretrieval/waterdata/chunking.py(new) — decorator, planner,RequestTooLarge,QuotaExhausteddataretrieval/waterdata/utils.py— wire@chunking.multi_value_chunkedoutside@filters.chunkedon_fetch_once; updated docstringtests/waterdata_test.py— 30 new tests covering planner, coordination, cap, quota machineryPublic API additions
dataretrieval.waterdata.chunking.RequestTooLarge— raised when planning can't make the URL fitdataretrieval.waterdata.chunking.QuotaExhausted— raised mid-call when the quota safety floor is breached; carries.partial_frame,.partial_response,.completed_chunks,.total_chunks,.remainingdataretrieval.waterdata.chunking.multi_value_chunked— decorator, in case downstream wraps a custom_fetch_once_DEFAULT_MAX_CHUNKS = 1000,_DEFAULT_QUOTA_SAFETY_FLOOR = 50(underscored — not part of the stable public API; available for monkey-patching by users with non-default quotas)Known limitations (deferred)
max_chunksproportionally for paginating workloads, or rely on theQuotaExhaustedabort to catch overruns at runtime.QuotaExhausted. Caller decides. Could addon_quota_exhausted="sleep"mode in a follow-up if desired.🤖 Generated with Claude Code