Use WN GPU monitoring as primary source for GPU brokerage attribute checks#714
Merged
Conversation
…hecks Implement Fernando's suggestion (ATLASPANDA-1684) to use MV_WORKER_NODE_GPU_SUMMARY as the data source for GPU brokerage: - CRIC is used only as a GPU capability gate (queue has 'gpu' in its architecture map) - All attribute checks (vendor, model, vram, architecture, CUDA version) are performed against the WN GPU monitoring MV, which has richer per-host data than CRIC New supported filters in host_gpu_spec: - vram: minimum VRAM in MB - architecture: GPU microarch generation (Ampere, Hopper, Ada Lovelace...) - version: minimum CUDA version (existing, now sourced from WN map) Adds get_worker_node_gpu_map() DB method and loads the map at broker init.
Add minimum NVIDIA kernel driver version (driver_version) as a new gpu_spec filter key, satisfying the ATLASPANDA-1684 requirement. Uses the existing compare_version_string utility against MV_WORKER_NODE_GPU_SUMMARY.driver_version.
…biguity with --architecture option
Extend get_host_gpu_spec() to parse colon-separated key<op>value attributes in the &vendor shorthand (cuda>=12.0, vram>=40960, uarch=Ampere, driver>=575.0, model=.*A100.*). Change vram filter to use compare_version_string so all operators (==, >=, <=, >, <, !=) are supported consistently with version/driver.
Production tasks were missing the WN GPU monitoring data in the brokerage because wn_gpu_map was not loaded or passed. Mirrors the pattern already used in AtlasAnalJobBroker.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
"gpu"in its CRIC architecture map it is considered GPU-capable; otherwise it is skipped immediatelyMV_WORKER_NODE_GPU_SUMMARY(WN GPU monitoring), which has richer per-host data than CRICget_worker_node_gpu_map()DB method + TaskBuffer wrapper to load the MV at broker inithost_gpu_spec(task--architectureJSON):vram: minimum VRAM in MB (e.g.40960)architecture: GPU microarch generation or list (e.g."Ampere"or["Ampere", "Hopper"])version: minimum CUDA version — existing filter, now sourced from WN mapframework_versionMotivation
ATLASPANDA-1684 + Fernando's suggestion to use
MV_WORKER_NODE_GPU_SUMMARYas a fallback/primary source for GPU brokerage. Previous approach relied entirely on CRIC, which often lacks model info, vram, and driver version for many GPU queues. WN monitoring data is populated directly by pilots and is more complete and up to date.Requires the companion panda-database PR to add
ARCHITECTUREandDRIVER_VERSIONcolumns to the MV.Test plan
vramfilter correctly rejects queues below the thresholdarchitecturefilter (e.g.["Ampere", "Hopper"]) routes to correct queuesversionfilter usesframework_versionfrom WN map