hw: per-bank PLATFORM_MEMORY_OFFSET — fix U250 vx_busy hang at boot by hwirys · Pull Request #342 · vortexgpgpu/vortex

hwirys · 2026-04-30T23:42:58Z

Why this matters: the U250 path on master cannot boot

On Vortex master, building for Alveo U250 appears to succeed (Vitis produces an xclbin), but the kernel never starts on real silicon. ap_start is asserted, vx_busy goes high, and stays high forever. The CTL register reads back 0x1 indefinitely; the host eventually times out. This is the failure pattern reported in #262, #263, #278.

Vortex 2.3 added the U250 build infrastructure (platform recognition, xrt::bo allocation, connectivity.sp plumbing) but did not address the BAR-mismatch between Vortex's compile-time absolute addresses and where XRT actually places memory. The xclbin is produced; it just doesn't execute.

This PR is the smallest patch that makes a U250 boot. It is a logical prerequisite for any other U250 work.

Root cause

Vortex hard-codes some compile-time absolute byte addresses:

Symbol	Value (XLEN=64)	Where
`STARTUP_ADDR`	`0x1_8000_0000`	Where the kernel binary's text/data lives, fetched by Vortex on `ap_start`
`STACK_BASE_ADDR`	`0x1_FFFF_0000`	Per-thread stack top

XRT, on the other hand, allocates each xrt::bo at a virtual address chosen by the platform. On U250, bank 0 lands at 0x40_0000_0000 (256 GiB into the device address space). All AXI traffic from Vortex therefore goes to a region the platform's AXI fabric does not decode → vx_busy = 1 forever.

The pre-existing PLATFORM_MEMORY_OFFSET macro is conceptually the right escape hatch — it adds a constant offset to every outgoing AXI address — but on master it is a single global offset for all banks, while real Xilinx XRT places each bank at a different virtual address. So a single offset cannot cover U200/U250 four-channel deployments either.

Fix

Three small files:

`hw/rtl/afu/xrt/vortex_afu.vh` (+18 / −0)

Introduce per-bank macros. Each defaults to the legacy global PLATFORM_MEMORY_OFFSET so existing single-channel platforms are unchanged:

`ifndef PLATFORM_MEMORY_OFFSET_0
`define PLATFORM_MEMORY_OFFSET_0 `PLATFORM_MEMORY_OFFSET
`endif
// ... _1, _2, _3 likewise

`hw/rtl/afu/xrt/VX_afu_wrap.sv` (+14 / −2)

Build a 4-entry array from those macros and add the bank-i offset to each m_axi_mem_ outgoing AW/AR address (replaces the single global add):

wire [C_M_AXI_MEM_ADDR_WIDTH-1:0] platform_memory_offsets [4];
assign platform_memory_offsets[0] = `PLATFORM_MEMORY_OFFSET_0;
// ... _1, _2, _3
for (genvar i = 0; i < C_M_AXI_MEM_NUM_BANKS; ++i) begin : g_addressing
    assign m_axi_mem_awaddr_a[i] = m_axi_mem_awaddr_u[i] + platform_memory_offsets[i];
    assign m_axi_mem_araddr_a[i] = m_axi_mem_araddr_u[i] + platform_memory_offsets[i];
end

`hw/syn/xilinx/xrt/platforms.mk` (+6 / −2)

Switch the U250 entry to single-channel with the offset hard-coded so the build is deployable out of the box:

else ifneq ($(findstring xilinx_u250,$(XSA)),)
  CONFIGS += -DPLATFORM_MEMORY_NUM_BANKS=1 -DPLATFORM_MEMORY_ADDR_WIDTH=34
  VPP_FLAGS += --connectivity.sp vortex_afu_1.m_axi_mem_0:DDR[0]
  CONFIGS += -DPLATFORM_MEMORY_OFFSET_0=40'h4000000000

(0x40_0000_0000 is the empirical XRT base for DDR[0] on the
xilinx_u250_gen3x16_xdma_4_1_202210_1 shell. Multi-bank — full 64 GB across 4 DDR4 channels — needs the AFU to learn each bo's actual XRT VA at runtime, since each xrt::bo is placed by the runtime allocator. That follow-up will be submitted as a separate PR with a DCR-based runtime path; this PR is intentionally minimal and out-of-the-box deployable.)

Verification on real Alveo U250

Built with this patch (default DSP FPU, NUM_CORES=2 NUM_WARPS=4 XLEN=64, 200 MHz) on real U250 silicon — XRT 2.19.194, shell xilinx_u250_gen3x16_xdma_4_1_202210_1. The PLP load (xdma_shell_4_1) is required once per cold boot as documented in the README.

Test category	Without this PR	With this PR
Any kernel start	hangs at `ap_start`, `vx_busy=1` forever	boots ✅
`vecadd` (n = 16…16384)	n/a (hang)	all sizes PASS
`dogfood` Test0..Test20 (iadd, imul, idiv, fadd, fsub, fmul, fmadd, fnmadd, fdiv, fsqrt, ftoi, itof, fclamp, iclamp, …)	n/a	PASS
regression: `sgemm`, `dotproduct`, `demo`, `dropout`, `conv3`, `io_addr`, `fence`, `diverge`	n/a	PASS
OpenCL: `saxpy`, `vecadd`, `sgemm`, `sgemm2`, `sgemm3`, `stencil`, `sfilter`, `spmv`, `psort`, `oclprintf`	n/a	PASS
single-rank MPI: `mpi_vecadd`, `mpi_dotproduct`, `mpi_diverge`, `mpi_put_dotproduct`	n/a	PASS

That spans integer arithmetic, single-precision FP, vector reductions, matrix multiply, convolution, ML-style ops (dropout), and the full OpenCL/MPI execution stacks. Across all of those there is no observed regression introduced by the patch.

(dogfood Test21-trig produces FP64 NaNs because the default DSP FPU is FP32-only — that's a pre-existing FPU-implementation issue, not addressed here.)

Build numbers: total xclbin link 2 h 30 m, WNS = +0.057 ns @ 200 MHz (positive, no impl-strategy tuning), area added by this patch ≈ 4 × 36-bit adders. Negligible.

Backward compatibility

PLATFORM_MEMORY_OFFSET_ defaults to PLATFORM_MEMORY_OFFSET, so any platform that didn't define per-bank offsets sees the same effective offset on every bank as before. HBM platforms (U280/U55C/U50, all PLATFORM_MERGED_MEMORY_INTERFACE), VCK5000 single-channel, and Zynq UltraScale+ are byte-for-byte identical.
The U250 platforms.mk entry changes from NUM_BANKS=4 (no connectivity) to NUM_BANKS=1 (DDR[0] + offset). The previous configuration produced an xclbin that didn't actually run, so this is a strict improvement; any user wishing to revisit four-channel deployment needs the runtime VA plumbing of the follow-up PR anyway.

Files touched

hw/rtl/afu/xrt/vortex_afu.vh — +18 / −0
hw/rtl/afu/xrt/VX_afu_wrap.sv — +14 / −2
hw/syn/xilinx/xrt/platforms.mk — +6 / −2

References: closes (or substantially mitigates) #262, #263, #278.

tinebp · 2026-05-11T12:11:28Z

Great investigation and catch!
Your proposed solution has a hardcoded 4-bank assumption. Can you generalize the fix in the RTL code?

hwirys · 2026-05-12T03:03:42Z

Thank you so much! I'll update general version ASAP

PR review feedback: a08b8d7 used a hardcoded [4] platform_memory_offsets array with explicit per-bank PLATFORM_MEMORY_OFFSET_0..3 macros, baking in a 4-bank assumption that doesn't hold for HBM (NUM_BANKS=32) or single-channel (NUM_BANKS=1) platforms even though the patch was intended to support all of them. Changes: - VX_afu_wrap.sv: platform_memory_offsets array is now sized by C_M_AXI_MEM_NUM_BANKS, and the per-bank assignment is a generate-for using the single PLATFORM_MEMORY_OFFSET macro applied to every bank. - vortex_afu.vh: drop the PLATFORM_MEMORY_OFFSET_0..3 numbered macros; the single PLATFORM_MEMORY_OFFSET (default 0) is the only knob. - platforms.mk: U250 uses PLATFORM_MEMORY_OFFSET=40'h4000000000 in place of the deprecated PLATFORM_MEMORY_OFFSET_0. For platforms that need a different VA per bank (e.g. U250 4-bank DDR4), a runtime DCR mechanism is added in a follow-up commit on master and is intentionally not part of this PR. Verified end-to-end on real Alveo U250 (xilinx_u250_gen3x16_xdma_4_1) via three independent xclbin builds (NUM_CORES=1, KERNEL_FREQ=150 MHz): - NUM_BANKS=1: 19 PASS / 0 FAIL / 1 known-fail (dogfood Test21-trig, pre-existing DSP-FPU FP64 sinf NaN unrelated to bank generalization) - NUM_BANKS=2: same — 19 PASS / 0 FAIL / 1 known-fail - NUM_BANKS=4: same — 19 PASS / 0 FAIL / 1 known-fail Test suite covers diverse memory-access and compute patterns to exercise the generalized bank addressing: Regression (CUDA-style): vecadd, sgemm, sgemm2, dotproduct, diverge, demo, conv3, fence, io_addr, dropout, printf, basic, arith, mstress OpenCL: saxpy, vecadd, sgemm, sfilter, oclprintf NUM_BANKS=4 IPC is consistently higher than NUM_BANKS=1 (e.g. vecadd 0.339 vs 0.308, dotproduct 0.367 vs 0.332), confirming the per-bank parallelism is functioning correctly. Refs: vortexgpgpu#342 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hwirys · 2026-05-12T07:51:12Z

Thanks for the review! Pushed 19bfcdf65 which drops the 4-bank assumption.

Changes

hw/rtl/afu/xrt/VX_afu_wrap.sv: platform_memory_offsets array now sized by C_M_AXI_MEM_NUM_BANKS, per-bank assignment via a generate-for loop using the single PLATFORM_MEMORY_OFFSET macro.
hw/rtl/afu/xrt/vortex_afu.vh: dropped the numbered PLATFORM_MEMORY_OFFSET_0..3 macros; only PLATFORM_MEMORY_OFFSET remains.
hw/syn/xilinx/xrt/platforms.mk: U250 uses the unified PLATFORM_MEMORY_OFFSET=40'h4000000000.

Net: +13 / -29 lines.

For platforms that need a different VA per bank (e.g. U250 4-bank DDR4), a DCR-based runtime push mechanism is on master and is intentionally a separate follow-up — this PR stays scoped to single-bank end-to-end fix + RTL generality.

Verification on real Alveo U250 (xilinx_u250_gen3x16_xdma_4_1)

Three independent xclbin builds at NUM_CORES=1, KERNEL_FREQ=150 MHz, identical RTL, varying PLATFORM_MEMORY_NUM_BANKS:

Config	Result
NUM_BANKS=1	19/19 PASS + 1 known-fail
NUM_BANKS=2	19/19 PASS + 1 known-fail
NUM_BANKS=4	19/19 PASS + 1 known-fail

Test coverage (each build):

Regression (CUDA-style): vecadd, sgemm, sgemm2, dotproduct, diverge, demo, conv3, fence, io_addr, dropout, printf, basic, arith, mstress
OpenCL: saxpy, vecadd, sgemm, sfilter, oclprintf

The known-fail is dogfood Test21-trig (NaN from sinf — pre-existing DSP-FPU FP64 limitation, fixable with -DFPU_FPNEW, unrelated to bank addressing).

NUM_BANKS=4 IPC is consistently higher than NUM_BANKS=1 across memory-bound workloads (e.g. vecadd 0.339 vs 0.308, dotproduct 0.367 vs 0.332), confirming the generalized per-bank addressing distributes traffic correctly.

PR review feedback: a08b8d7 used a hardcoded [4] platform_memory_offsets array with explicit per-bank PLATFORM_MEMORY_OFFSET_0..3 macros, baking in a 4-bank assumption that doesn't hold for HBM (NUM_BANKS=32) or single-channel (NUM_BANKS=1) platforms even though the patch was intended to support all of them. Changes: - VX_afu_wrap.sv: platform_memory_offsets array is now sized by C_M_AXI_MEM_NUM_BANKS, and the per-bank assignment is a generate-for using the single PLATFORM_MEMORY_OFFSET macro applied to every bank. - vortex_afu.vh: drop the PLATFORM_MEMORY_OFFSET_0..3 numbered macros; the single PLATFORM_MEMORY_OFFSET (default 0) is the only knob. - platforms.mk: U250 uses PLATFORM_MEMORY_OFFSET=40'h4000000000 in place of the deprecated PLATFORM_MEMORY_OFFSET_0. For platforms that need a different VA per bank (e.g. U250 4-bank DDR4), a runtime DCR mechanism is added in a follow-up commit on master and is intentionally not part of this PR. Verified end-to-end on real Alveo U250 (xilinx_u250_gen3x16_xdma_4_1) via three independent xclbin builds (NUM_CORES=1, KERNEL_FREQ=150 MHz): - NUM_BANKS=1: 19 PASS / 0 FAIL / 1 known-fail (dogfood Test21-trig, pre-existing DSP-FPU FP64 sinf NaN unrelated to bank generalization) - NUM_BANKS=2: same — 19 PASS / 0 FAIL / 1 known-fail - NUM_BANKS=4: same — 19 PASS / 0 FAIL / 1 known-fail Test suite covers diverse memory-access and compute patterns to exercise the generalized bank addressing: Regression (CUDA-style): vecadd, sgemm, sgemm2, dotproduct, diverge, demo, conv3, fence, io_addr, dropout, printf, basic, arith, mstress OpenCL: saxpy, vecadd, sgemm, sfilter, oclprintf NUM_BANKS=4 IPC is consistently higher than NUM_BANKS=1 (e.g. vecadd 0.339 vs 0.308, dotproduct 0.367 vs 0.332), confirming the per-bank parallelism is functioning correctly. Refs: vortexgpgpu#342

PR vortexgpgpu#342 review feedback: the per-bank XRT VA offset path landed in 79e50a8 with a hardcoded 4-bank assumption (bank_offset_q[4], LO0..HI3 numbered DCR macros, PLATFORM_MEMORY_OFFSET_0..3 synthesis-time macros, runtime cap at i<4). The RTL should elaborate for any NUM_BANKS without code changes. Changes: - VX_types.vh: drop the LO0..HI3/END numbered macros; keep only the parameterized helpers LO(i)/HI(i); add VX_DCR_BASE_BANK_OFFSET_MAX (=32) as the DCR address-space cap and width-clean the helpers with a VX_DCR_ADDR_BITS' size cast on the i-dependent term. - VX_afu_wrap.sv: bank_offset_q and platform_memory_offsets arrays size to C_M_AXI_MEM_NUM_BANKS; per-bank reset/snoop becomes a generate-for using localparam DCR_LO/HI for the address compare, reset value comes from the single PLATFORM_MEMORY_OFFSET macro. - vortex_afu.vh: drop PLATFORM_MEMORY_OFFSET_0..3 (each bank now resets to PLATFORM_MEMORY_OFFSET). - runtime/xrt/vortex.cpp: cap relaxed from i<4 to i<VX_DCR_BASE_BANK_OFFSET_MAX. - gen_config.py: add a translation rule for SV size casts (12'(x) / IDENT'(x)) so the width-clean helper emits plain C arithmetic in VX_types.h. Bit-identical to prior 4-bank behaviour: the generate-for with NUM_BANKS=4 produces the same per-bank reset and DCR snoop as the explicit case form. Verified on real Alveo U250 platform via parallel gen-xo synthesis for NUM_BANKS={1, 2, 4} (NUM_CORES=1): all three .xo files build cleanly, and kernel.xml emits the right number of m_axi_mem_ ports and MEM_ args per build. Verilator elaboration: no new warnings introduced (the two pre-existing VX_sp_ram/VX_dp_ram REDEFMACRO are unrelated). Refs: vortexgpgpu#342

When Vortex is built for an XRT-based Xilinx platform that allocates each memory bank as a separate xrt::bo, XRT picks a different virtual address per bank (e.g. on U250 bank 0 lands at 0x40_00000000) — which is far above Vortex's compile-time absolute addresses STARTUP_ADDR = 0x180000000 and STACK_BASE_ADDR = 0x1FFFF0000. AXI requests from Vortex therefore fail to decode at the slave, vx_busy stays high forever, and the kernel never starts. This is the underlying cause of the long-running U250 hang reports (vortexgpgpu#262, vortexgpgpu#263, vortexgpgpu#278) — Vortex 2.3 added the U250 build path but did not address this BAR mismatch, so the produced xclbin has never actually booted on real silicon. Fix - vortex_afu.vh: introduce per-bank PLATFORM_MEMORY_OFFSET_ (i = 0..3) macros, each defaulting to the legacy global PLATFORM_MEMORY_OFFSET so HBM platforms (U280/U55C/U50) and VCK5000 single-channel are byte-for-byte unchanged. - VX_afu_wrap.sv: build a 4-entry platform_memory_offsets array from those macros and add the bank-i offset to each outgoing m_axi_mem_ AW/AR address. - platforms.mk U250: switch to single-bank (NUM_BANKS=1, DDR[0]) and set PLATFORM_MEMORY_OFFSET_0=40'h4000000000 so the build works end-to-end out of the box. Multi-bank (full 64 GB) deployment needs a runtime mechanism to push each bo's actual XRT VA into the AFU and will follow as a separate PR. Verified end-to-end on real Alveo U250 (XRT 2.19.194, shell xilinx_u250_gen3x16_xdma_4_1_202210_1) at 200 MHz with the default DSP FPU. Without this patch the kernel hangs immediately at ap_start (vx_busy stuck high, CTL register reads back 0x1 indefinitely). With the patch the kernel boots and: - regression `vecadd` (n=16..16384), `sgemm`, `dotproduct`, `demo`, `dropout`, `conv3`, `io_addr`, `fence`, `diverge` — all pass - `dogfood` Test0..Test20 — pass - OpenCL: `saxpy`, `vecadd`, `sgemm`, `sgemm2`, `sgemm3`, `stencil`, `sfilter`, `spmv`, `psort`, `oclprintf` — pass - single-rank MPI: `mpi_vecadd`, `mpi_dotproduct`, `mpi_diverge`, `mpi_put_dotproduct` — pass Final WNS = +0.057 ns at 200 MHz; the patch is purely combinational addressing logic (3 extra adds in the AFU) so area and timing impact are negligible. Refs: vortexgpgpu#262, vortexgpgpu#263, vortexgpgpu#278.

PR review feedback: a08b8d7 used a hardcoded [4] platform_memory_offsets array with explicit per-bank PLATFORM_MEMORY_OFFSET_0..3 macros, baking in a 4-bank assumption that doesn't hold for HBM (NUM_BANKS=32) or single-channel (NUM_BANKS=1) platforms even though the patch was intended to support all of them. Changes: - VX_afu_wrap.sv: platform_memory_offsets array is now sized by C_M_AXI_MEM_NUM_BANKS, and the per-bank assignment is a generate-for using the single PLATFORM_MEMORY_OFFSET macro applied to every bank. - vortex_afu.vh: drop the PLATFORM_MEMORY_OFFSET_0..3 numbered macros; the single PLATFORM_MEMORY_OFFSET (default 0) is the only knob. - platforms.mk: U250 uses PLATFORM_MEMORY_OFFSET=40'h4000000000 in place of the deprecated PLATFORM_MEMORY_OFFSET_0. For platforms that need a different VA per bank (e.g. U250 4-bank DDR4), a runtime DCR mechanism is added in a follow-up commit on master and is intentionally not part of this PR. Verified end-to-end on real Alveo U250 (xilinx_u250_gen3x16_xdma_4_1) via three independent xclbin builds (NUM_CORES=1, KERNEL_FREQ=150 MHz): - NUM_BANKS=1: 19 PASS / 0 FAIL / 1 known-fail (dogfood Test21-trig, pre-existing DSP-FPU FP64 sinf NaN unrelated to bank generalization) - NUM_BANKS=2: same — 19 PASS / 0 FAIL / 1 known-fail - NUM_BANKS=4: same — 19 PASS / 0 FAIL / 1 known-fail Test suite covers diverse memory-access and compute patterns to exercise the generalized bank addressing: Regression (CUDA-style): vecadd, sgemm, sgemm2, dotproduct, diverge, demo, conv3, fence, io_addr, dropout, printf, basic, arith, mstress OpenCL: saxpy, vecadd, sgemm, sfilter, oclprintf NUM_BANKS=4 IPC is consistently higher than NUM_BANKS=1 (e.g. vecadd 0.339 vs 0.308, dotproduct 0.367 vs 0.332), confirming the per-bank parallelism is functioning correctly. Refs: vortexgpgpu#342

hwirys force-pushed the fix/u250-platform-memory-offset branch from 1c041f0 to 19bfcdf Compare May 12, 2026 07:54

hwirys added 2 commits May 12, 2026 21:34

hwirys force-pushed the fix/u250-platform-memory-offset branch from 19bfcdf to 07b165b Compare May 12, 2026 12:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hw: per-bank PLATFORM_MEMORY_OFFSET — fix U250 vx_busy hang at boot#342

hw: per-bank PLATFORM_MEMORY_OFFSET — fix U250 vx_busy hang at boot#342
hwirys wants to merge 2 commits into
vortexgpgpu:masterfrom
hwirys:fix/u250-platform-memory-offset

hwirys commented Apr 30, 2026

Uh oh!

tinebp commented May 11, 2026

Uh oh!

hwirys commented May 12, 2026

Uh oh!

hwirys commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hwirys commented Apr 30, 2026

Why this matters: the U250 path on master cannot boot

Root cause

Fix

hw/rtl/afu/xrt/vortex_afu.vh (+18 / −0)

hw/rtl/afu/xrt/VX_afu_wrap.sv (+14 / −2)

hw/syn/xilinx/xrt/platforms.mk (+6 / −2)

Verification on real Alveo U250

Backward compatibility

Files touched

Uh oh!

tinebp commented May 11, 2026

Uh oh!

hwirys commented May 12, 2026

Uh oh!

hwirys commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`hw/rtl/afu/xrt/vortex_afu.vh` (+18 / −0)

`hw/rtl/afu/xrt/VX_afu_wrap.sv` (+14 / −2)

`hw/syn/xilinx/xrt/platforms.mk` (+6 / −2)

hwirys commented May 12, 2026 •

edited

Loading