Skip to content

Use WN GPU monitoring as primary source for GPU brokerage attribute checks#714

Merged
EdwardKaravakis merged 5 commits into
masterfrom
feat/gpu-brokerage-wn-map
May 18, 2026
Merged

Use WN GPU monitoring as primary source for GPU brokerage attribute checks#714
EdwardKaravakis merged 5 commits into
masterfrom
feat/gpu-brokerage-wn-map

Conversation

@EdwardKaravakis
Copy link
Copy Markdown
Member

Summary

  • CRIC is now used only as a GPU capability gate: if a queue has "gpu" in its CRIC architecture map it is considered GPU-capable; otherwise it is skipped immediately
  • All GPU attribute checks (vendor, model, vram, architecture, CUDA version) are performed against MV_WORKER_NODE_GPU_SUMMARY (WN GPU monitoring), which has richer per-host data than CRIC
  • New get_worker_node_gpu_map() DB method + TaskBuffer wrapper to load the MV at broker init
  • New supported filters in host_gpu_spec (task --architecture JSON):
    • vram: minimum VRAM in MB (e.g. 40960)
    • architecture: GPU microarch generation or list (e.g. "Ampere" or ["Ampere", "Hopper"])
    • version: minimum CUDA version — existing filter, now sourced from WN map framework_version

Motivation

ATLASPANDA-1684 + Fernando's suggestion to use MV_WORKER_NODE_GPU_SUMMARY as a fallback/primary source for GPU brokerage. Previous approach relied entirely on CRIC, which often lacks model info, vram, and driver version for many GPU queues. WN monitoring data is populated directly by pilots and is more complete and up to date.

Requires the companion panda-database PR to add ARCHITECTURE and DRIVER_VERSION columns to the MV.

Test plan

  • Verify queues not marked GPU in CRIC are skipped without querying WN map
  • Verify model include/exclude pattern matching works against WN map models
  • Verify vram filter correctly rejects queues below the threshold
  • Verify architecture filter (e.g. ["Ampere", "Hopper"]) routes to correct queues
  • Verify CUDA version filter uses framework_version from WN map
  • Verify queues GPU-capable in CRIC but absent from WN map: wildcard task passes, specific-attribute task fails

…hecks

Implement Fernando's suggestion (ATLASPANDA-1684) to use
MV_WORKER_NODE_GPU_SUMMARY as the data source for GPU brokerage:

- CRIC is used only as a GPU capability gate (queue has 'gpu' in its
  architecture map)
- All attribute checks (vendor, model, vram, architecture, CUDA version)
  are performed against the WN GPU monitoring MV, which has richer
  per-host data than CRIC

New supported filters in host_gpu_spec:
- vram: minimum VRAM in MB
- architecture: GPU microarch generation (Ampere, Hopper, Ada Lovelace...)
- version: minimum CUDA version (existing, now sourced from WN map)

Adds get_worker_node_gpu_map() DB method and loads the map at broker init.
Add minimum NVIDIA kernel driver version (driver_version) as a new gpu_spec
filter key, satisfying the ATLASPANDA-1684 requirement. Uses the existing
compare_version_string utility against MV_WORKER_NODE_GPU_SUMMARY.driver_version.
Extend get_host_gpu_spec() to parse colon-separated key<op>value attributes
in the &vendor shorthand (cuda>=12.0, vram>=40960, uarch=Ampere, driver>=575.0,
model=.*A100.*). Change vram filter to use compare_version_string so all
operators (==, >=, <=, >, <, !=) are supported consistently with version/driver.
@EdwardKaravakis EdwardKaravakis marked this pull request as ready for review May 11, 2026 13:49
Production tasks were missing the WN GPU monitoring data in the brokerage
because wn_gpu_map was not loaded or passed. Mirrors the pattern already
used in AtlasAnalJobBroker.
@EdwardKaravakis EdwardKaravakis merged commit 4c6a029 into master May 18, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant