hw: per-bank PLATFORM_MEMORY_OFFSET — fix U250 vx_busy hang at boot#342
hw: per-bank PLATFORM_MEMORY_OFFSET — fix U250 vx_busy hang at boot#342hwirys wants to merge 2 commits into
Conversation
|
Great investigation and catch! |
|
Thank you so much! I'll update general version ASAP |
PR review feedback: a08b8d7 used a hardcoded [4] platform_memory_offsets array with explicit per-bank PLATFORM_MEMORY_OFFSET_0..3 macros, baking in a 4-bank assumption that doesn't hold for HBM (NUM_BANKS=32) or single-channel (NUM_BANKS=1) platforms even though the patch was intended to support all of them. Changes: - VX_afu_wrap.sv: platform_memory_offsets array is now sized by C_M_AXI_MEM_NUM_BANKS, and the per-bank assignment is a generate-for using the single PLATFORM_MEMORY_OFFSET macro applied to every bank. - vortex_afu.vh: drop the PLATFORM_MEMORY_OFFSET_0..3 numbered macros; the single PLATFORM_MEMORY_OFFSET (default 0) is the only knob. - platforms.mk: U250 uses PLATFORM_MEMORY_OFFSET=40'h4000000000 in place of the deprecated PLATFORM_MEMORY_OFFSET_0. For platforms that need a different VA per bank (e.g. U250 4-bank DDR4), a runtime DCR mechanism is added in a follow-up commit on master and is intentionally not part of this PR. Verified end-to-end on real Alveo U250 (xilinx_u250_gen3x16_xdma_4_1) via three independent xclbin builds (NUM_CORES=1, KERNEL_FREQ=150 MHz): - NUM_BANKS=1: 19 PASS / 0 FAIL / 1 known-fail (dogfood Test21-trig, pre-existing DSP-FPU FP64 sinf NaN unrelated to bank generalization) - NUM_BANKS=2: same — 19 PASS / 0 FAIL / 1 known-fail - NUM_BANKS=4: same — 19 PASS / 0 FAIL / 1 known-fail Test suite covers diverse memory-access and compute patterns to exercise the generalized bank addressing: Regression (CUDA-style): vecadd, sgemm, sgemm2, dotproduct, diverge, demo, conv3, fence, io_addr, dropout, printf, basic, arith, mstress OpenCL: saxpy, vecadd, sgemm, sfilter, oclprintf NUM_BANKS=4 IPC is consistently higher than NUM_BANKS=1 (e.g. vecadd 0.339 vs 0.308, dotproduct 0.367 vs 0.332), confirming the per-bank parallelism is functioning correctly. Refs: vortexgpgpu#342 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the review! Pushed Changes
Net: +13 / -29 lines. For platforms that need a different VA per bank (e.g. U250 4-bank DDR4), a DCR-based runtime push mechanism is on master and is intentionally a separate follow-up — this PR stays scoped to single-bank end-to-end fix + RTL generality. Verification on real Alveo U250 ( Three independent xclbin builds at NUM_CORES=1, KERNEL_FREQ=150 MHz, identical RTL, varying
Test coverage (each build):
The known-fail is NUM_BANKS=4 IPC is consistently higher than NUM_BANKS=1 across memory-bound workloads (e.g. |
PR review feedback: a08b8d7 used a hardcoded [4] platform_memory_offsets array with explicit per-bank PLATFORM_MEMORY_OFFSET_0..3 macros, baking in a 4-bank assumption that doesn't hold for HBM (NUM_BANKS=32) or single-channel (NUM_BANKS=1) platforms even though the patch was intended to support all of them. Changes: - VX_afu_wrap.sv: platform_memory_offsets array is now sized by C_M_AXI_MEM_NUM_BANKS, and the per-bank assignment is a generate-for using the single PLATFORM_MEMORY_OFFSET macro applied to every bank. - vortex_afu.vh: drop the PLATFORM_MEMORY_OFFSET_0..3 numbered macros; the single PLATFORM_MEMORY_OFFSET (default 0) is the only knob. - platforms.mk: U250 uses PLATFORM_MEMORY_OFFSET=40'h4000000000 in place of the deprecated PLATFORM_MEMORY_OFFSET_0. For platforms that need a different VA per bank (e.g. U250 4-bank DDR4), a runtime DCR mechanism is added in a follow-up commit on master and is intentionally not part of this PR. Verified end-to-end on real Alveo U250 (xilinx_u250_gen3x16_xdma_4_1) via three independent xclbin builds (NUM_CORES=1, KERNEL_FREQ=150 MHz): - NUM_BANKS=1: 19 PASS / 0 FAIL / 1 known-fail (dogfood Test21-trig, pre-existing DSP-FPU FP64 sinf NaN unrelated to bank generalization) - NUM_BANKS=2: same — 19 PASS / 0 FAIL / 1 known-fail - NUM_BANKS=4: same — 19 PASS / 0 FAIL / 1 known-fail Test suite covers diverse memory-access and compute patterns to exercise the generalized bank addressing: Regression (CUDA-style): vecadd, sgemm, sgemm2, dotproduct, diverge, demo, conv3, fence, io_addr, dropout, printf, basic, arith, mstress OpenCL: saxpy, vecadd, sgemm, sfilter, oclprintf NUM_BANKS=4 IPC is consistently higher than NUM_BANKS=1 (e.g. vecadd 0.339 vs 0.308, dotproduct 0.367 vs 0.332), confirming the per-bank parallelism is functioning correctly. Refs: vortexgpgpu#342
1c041f0 to
19bfcdf
Compare
PR vortexgpgpu#342 review feedback: the per-bank XRT VA offset path landed in 79e50a8 with a hardcoded 4-bank assumption (bank_offset_q[4], LO0..HI3 numbered DCR macros, PLATFORM_MEMORY_OFFSET_0..3 synthesis-time macros, runtime cap at i<4). The RTL should elaborate for any NUM_BANKS without code changes. Changes: - VX_types.vh: drop the LO0..HI3/END numbered macros; keep only the parameterized helpers LO(i)/HI(i); add VX_DCR_BASE_BANK_OFFSET_MAX (=32) as the DCR address-space cap and width-clean the helpers with a VX_DCR_ADDR_BITS' size cast on the i-dependent term. - VX_afu_wrap.sv: bank_offset_q and platform_memory_offsets arrays size to C_M_AXI_MEM_NUM_BANKS; per-bank reset/snoop becomes a generate-for using localparam DCR_LO/HI for the address compare, reset value comes from the single PLATFORM_MEMORY_OFFSET macro. - vortex_afu.vh: drop PLATFORM_MEMORY_OFFSET_0..3 (each bank now resets to PLATFORM_MEMORY_OFFSET). - runtime/xrt/vortex.cpp: cap relaxed from i<4 to i<VX_DCR_BASE_BANK_OFFSET_MAX. - gen_config.py: add a translation rule for SV size casts (12'(x) / IDENT'(x)) so the width-clean helper emits plain C arithmetic in VX_types.h. Bit-identical to prior 4-bank behaviour: the generate-for with NUM_BANKS=4 produces the same per-bank reset and DCR snoop as the explicit case form. Verified on real Alveo U250 platform via parallel gen-xo synthesis for NUM_BANKS={1, 2, 4} (NUM_CORES=1): all three .xo files build cleanly, and kernel.xml emits the right number of m_axi_mem_<i> ports and MEM_<i> args per build. Verilator elaboration: no new warnings introduced (the two pre-existing VX_sp_ram/VX_dp_ram REDEFMACRO are unrelated). Refs: vortexgpgpu#342
When Vortex is built for an XRT-based Xilinx platform that allocates each memory bank as a separate xrt::bo, XRT picks a different virtual address per bank (e.g. on U250 bank 0 lands at 0x40_00000000) — which is far above Vortex's compile-time absolute addresses STARTUP_ADDR = 0x180000000 and STACK_BASE_ADDR = 0x1FFFF0000. AXI requests from Vortex therefore fail to decode at the slave, vx_busy stays high forever, and the kernel never starts. This is the underlying cause of the long-running U250 hang reports (vortexgpgpu#262, vortexgpgpu#263, vortexgpgpu#278) — Vortex 2.3 added the U250 build path but did not address this BAR mismatch, so the produced xclbin has never actually booted on real silicon. Fix - vortex_afu.vh: introduce per-bank PLATFORM_MEMORY_OFFSET_<i> (i = 0..3) macros, each defaulting to the legacy global PLATFORM_MEMORY_OFFSET so HBM platforms (U280/U55C/U50) and VCK5000 single-channel are byte-for-byte unchanged. - VX_afu_wrap.sv: build a 4-entry platform_memory_offsets array from those macros and add the bank-i offset to each outgoing m_axi_mem_<i> AW/AR address. - platforms.mk U250: switch to single-bank (NUM_BANKS=1, DDR[0]) and set PLATFORM_MEMORY_OFFSET_0=40'h4000000000 so the build works end-to-end out of the box. Multi-bank (full 64 GB) deployment needs a runtime mechanism to push each bo's actual XRT VA into the AFU and will follow as a separate PR. Verified end-to-end on real Alveo U250 (XRT 2.19.194, shell xilinx_u250_gen3x16_xdma_4_1_202210_1) at 200 MHz with the default DSP FPU. Without this patch the kernel hangs immediately at ap_start (vx_busy stuck high, CTL register reads back 0x1 indefinitely). With the patch the kernel boots and: - regression `vecadd` (n=16..16384), `sgemm`, `dotproduct`, `demo`, `dropout`, `conv3`, `io_addr`, `fence`, `diverge` — all pass - `dogfood` Test0..Test20 — pass - OpenCL: `saxpy`, `vecadd`, `sgemm`, `sgemm2`, `sgemm3`, `stencil`, `sfilter`, `spmv`, `psort`, `oclprintf` — pass - single-rank MPI: `mpi_vecadd`, `mpi_dotproduct`, `mpi_diverge`, `mpi_put_dotproduct` — pass Final WNS = +0.057 ns at 200 MHz; the patch is purely combinational addressing logic (3 extra adds in the AFU) so area and timing impact are negligible. Refs: vortexgpgpu#262, vortexgpgpu#263, vortexgpgpu#278.
PR review feedback: a08b8d7 used a hardcoded [4] platform_memory_offsets array with explicit per-bank PLATFORM_MEMORY_OFFSET_0..3 macros, baking in a 4-bank assumption that doesn't hold for HBM (NUM_BANKS=32) or single-channel (NUM_BANKS=1) platforms even though the patch was intended to support all of them. Changes: - VX_afu_wrap.sv: platform_memory_offsets array is now sized by C_M_AXI_MEM_NUM_BANKS, and the per-bank assignment is a generate-for using the single PLATFORM_MEMORY_OFFSET macro applied to every bank. - vortex_afu.vh: drop the PLATFORM_MEMORY_OFFSET_0..3 numbered macros; the single PLATFORM_MEMORY_OFFSET (default 0) is the only knob. - platforms.mk: U250 uses PLATFORM_MEMORY_OFFSET=40'h4000000000 in place of the deprecated PLATFORM_MEMORY_OFFSET_0. For platforms that need a different VA per bank (e.g. U250 4-bank DDR4), a runtime DCR mechanism is added in a follow-up commit on master and is intentionally not part of this PR. Verified end-to-end on real Alveo U250 (xilinx_u250_gen3x16_xdma_4_1) via three independent xclbin builds (NUM_CORES=1, KERNEL_FREQ=150 MHz): - NUM_BANKS=1: 19 PASS / 0 FAIL / 1 known-fail (dogfood Test21-trig, pre-existing DSP-FPU FP64 sinf NaN unrelated to bank generalization) - NUM_BANKS=2: same — 19 PASS / 0 FAIL / 1 known-fail - NUM_BANKS=4: same — 19 PASS / 0 FAIL / 1 known-fail Test suite covers diverse memory-access and compute patterns to exercise the generalized bank addressing: Regression (CUDA-style): vecadd, sgemm, sgemm2, dotproduct, diverge, demo, conv3, fence, io_addr, dropout, printf, basic, arith, mstress OpenCL: saxpy, vecadd, sgemm, sfilter, oclprintf NUM_BANKS=4 IPC is consistently higher than NUM_BANKS=1 (e.g. vecadd 0.339 vs 0.308, dotproduct 0.367 vs 0.332), confirming the per-bank parallelism is functioning correctly. Refs: vortexgpgpu#342
19bfcdf to
07b165b
Compare
Why this matters: the U250 path on master cannot boot
On Vortex
master, building for Alveo U250 appears to succeed (Vitis produces an xclbin), but the kernel never starts on real silicon.ap_startis asserted,vx_busygoes high, and stays high forever. TheCTLregister reads back0x1indefinitely; the host eventually times out. This is the failure pattern reported in #262, #263, #278.Vortex 2.3 added the U250 build infrastructure (platform recognition,
xrt::boallocation,connectivity.spplumbing) but did not address the BAR-mismatch between Vortex's compile-time absolute addresses and where XRT actually places memory. The xclbin is produced; it just doesn't execute.This PR is the smallest patch that makes a U250 boot. It is a logical prerequisite for any other U250 work.
Root cause
Vortex hard-codes some compile-time absolute byte addresses:
STARTUP_ADDR0x1_8000_0000ap_startSTACK_BASE_ADDR0x1_FFFF_0000XRT, on the other hand, allocates each
xrt::boat a virtual address chosen by the platform. On U250, bank 0 lands at0x40_0000_0000(256 GiB into the device address space). All AXI traffic from Vortex therefore goes to a region the platform's AXI fabric does not decode →vx_busy = 1forever.The pre-existing
PLATFORM_MEMORY_OFFSETmacro is conceptually the right escape hatch — it adds a constant offset to every outgoing AXI address — but on master it is a single global offset for all banks, while real Xilinx XRT places each bank at a different virtual address. So a single offset cannot cover U200/U250 four-channel deployments either.Fix
Three small files:
hw/rtl/afu/xrt/vortex_afu.vh(+18 / −0)Introduce per-bank macros. Each defaults to the legacy global
PLATFORM_MEMORY_OFFSETso existing single-channel platforms are unchanged:hw/rtl/afu/xrt/VX_afu_wrap.sv(+14 / −2)Build a 4-entry array from those macros and add the bank-i offset to each
m_axi_mem_<i>outgoing AW/AR address (replaces the single global add):hw/syn/xilinx/xrt/platforms.mk(+6 / −2)Switch the U250 entry to single-channel with the offset hard-coded so the build is deployable out of the box:
(
0x40_0000_0000is the empirical XRT base forDDR[0]on thexilinx_u250_gen3x16_xdma_4_1_202210_1shell. Multi-bank — full 64 GB across 4 DDR4 channels — needs the AFU to learn each bo's actual XRT VA at runtime, since eachxrt::bois placed by the runtime allocator. That follow-up will be submitted as a separate PR with a DCR-based runtime path; this PR is intentionally minimal and out-of-the-box deployable.)Verification on real Alveo U250
Built with this patch (default DSP FPU,
NUM_CORES=2 NUM_WARPS=4 XLEN=64, 200 MHz) on real U250 silicon — XRT 2.19.194, shellxilinx_u250_gen3x16_xdma_4_1_202210_1. The PLP load (xdma_shell_4_1) is required once per cold boot as documented in the README.ap_start,vx_busy=1forevervecadd(n = 16…16384)dogfoodTest0..Test20 (iadd, imul, idiv, fadd, fsub, fmul, fmadd, fnmadd, fdiv, fsqrt, ftoi, itof, fclamp, iclamp, …)sgemm,dotproduct,demo,dropout,conv3,io_addr,fence,divergesaxpy,vecadd,sgemm,sgemm2,sgemm3,stencil,sfilter,spmv,psort,oclprintfmpi_vecadd,mpi_dotproduct,mpi_diverge,mpi_put_dotproductThat spans integer arithmetic, single-precision FP, vector reductions, matrix multiply, convolution, ML-style ops (dropout), and the full OpenCL/MPI execution stacks. Across all of those there is no observed regression introduced by the patch.
(
dogfoodTest21-trig produces FP64 NaNs because the default DSP FPU is FP32-only — that's a pre-existing FPU-implementation issue, not addressed here.)Build numbers: total xclbin link 2 h 30 m, WNS = +0.057 ns @ 200 MHz (positive, no impl-strategy tuning), area added by this patch ≈ 4 × 36-bit adders. Negligible.
Backward compatibility
PLATFORM_MEMORY_OFFSET_<i>defaults toPLATFORM_MEMORY_OFFSET, so any platform that didn't define per-bank offsets sees the same effective offset on every bank as before. HBM platforms (U280/U55C/U50, allPLATFORM_MERGED_MEMORY_INTERFACE), VCK5000 single-channel, and Zynq UltraScale+ are byte-for-byte identical.NUM_BANKS=4 (no connectivity)toNUM_BANKS=1 (DDR[0] + offset). The previous configuration produced an xclbin that didn't actually run, so this is a strict improvement; any user wishing to revisit four-channel deployment needs the runtime VA plumbing of the follow-up PR anyway.Files touched
hw/rtl/afu/xrt/vortex_afu.vh— +18 / −0hw/rtl/afu/xrt/VX_afu_wrap.sv— +14 / −2hw/syn/xilinx/xrt/platforms.mk— +6 / −2References: closes (or substantially mitigates) #262, #263, #278.