Skip to content

hw: per-bank PLATFORM_MEMORY_OFFSET — fix U250 vx_busy hang at boot#342

Open
hwirys wants to merge 2 commits into
vortexgpgpu:masterfrom
hwirys:fix/u250-platform-memory-offset
Open

hw: per-bank PLATFORM_MEMORY_OFFSET — fix U250 vx_busy hang at boot#342
hwirys wants to merge 2 commits into
vortexgpgpu:masterfrom
hwirys:fix/u250-platform-memory-offset

Conversation

@hwirys
Copy link
Copy Markdown

@hwirys hwirys commented Apr 30, 2026

Why this matters: the U250 path on master cannot boot

On Vortex master, building for Alveo U250 appears to succeed (Vitis produces an xclbin), but the kernel never starts on real silicon. ap_start is asserted, vx_busy goes high, and stays high forever. The CTL register reads back 0x1 indefinitely; the host eventually times out. This is the failure pattern reported in #262, #263, #278.

Vortex 2.3 added the U250 build infrastructure (platform recognition, xrt::bo allocation, connectivity.sp plumbing) but did not address the BAR-mismatch between Vortex's compile-time absolute addresses and where XRT actually places memory. The xclbin is produced; it just doesn't execute.

This PR is the smallest patch that makes a U250 boot. It is a logical prerequisite for any other U250 work.


Root cause

Vortex hard-codes some compile-time absolute byte addresses:

Symbol Value (XLEN=64) Where
STARTUP_ADDR 0x1_8000_0000 Where the kernel binary's text/data lives, fetched by Vortex on ap_start
STACK_BASE_ADDR 0x1_FFFF_0000 Per-thread stack top

XRT, on the other hand, allocates each xrt::bo at a virtual address chosen by the platform. On U250, bank 0 lands at 0x40_0000_0000 (256 GiB into the device address space). All AXI traffic from Vortex therefore goes to a region the platform's AXI fabric does not decode → vx_busy = 1 forever.

The pre-existing PLATFORM_MEMORY_OFFSET macro is conceptually the right escape hatch — it adds a constant offset to every outgoing AXI address — but on master it is a single global offset for all banks, while real Xilinx XRT places each bank at a different virtual address. So a single offset cannot cover U200/U250 four-channel deployments either.


Fix

Three small files:

hw/rtl/afu/xrt/vortex_afu.vh (+18 / −0)

Introduce per-bank macros. Each defaults to the legacy global PLATFORM_MEMORY_OFFSET so existing single-channel platforms are unchanged:

`ifndef PLATFORM_MEMORY_OFFSET_0
`define PLATFORM_MEMORY_OFFSET_0 `PLATFORM_MEMORY_OFFSET
`endif
// ... _1, _2, _3 likewise

hw/rtl/afu/xrt/VX_afu_wrap.sv (+14 / −2)

Build a 4-entry array from those macros and add the bank-i offset to each m_axi_mem_<i> outgoing AW/AR address (replaces the single global add):

wire [C_M_AXI_MEM_ADDR_WIDTH-1:0] platform_memory_offsets [4];
assign platform_memory_offsets[0] = `PLATFORM_MEMORY_OFFSET_0;
// ... _1, _2, _3
for (genvar i = 0; i < C_M_AXI_MEM_NUM_BANKS; ++i) begin : g_addressing
    assign m_axi_mem_awaddr_a[i] = m_axi_mem_awaddr_u[i] + platform_memory_offsets[i];
    assign m_axi_mem_araddr_a[i] = m_axi_mem_araddr_u[i] + platform_memory_offsets[i];
end

hw/syn/xilinx/xrt/platforms.mk (+6 / −2)

Switch the U250 entry to single-channel with the offset hard-coded so the build is deployable out of the box:

else ifneq ($(findstring xilinx_u250,$(XSA)),)
  CONFIGS += -DPLATFORM_MEMORY_NUM_BANKS=1 -DPLATFORM_MEMORY_ADDR_WIDTH=34
  VPP_FLAGS += --connectivity.sp vortex_afu_1.m_axi_mem_0:DDR[0]
  CONFIGS += -DPLATFORM_MEMORY_OFFSET_0=40'h4000000000

(0x40_0000_0000 is the empirical XRT base for DDR[0] on the
xilinx_u250_gen3x16_xdma_4_1_202210_1 shell. Multi-bank — full 64 GB across 4 DDR4 channels — needs the AFU to learn each bo's actual XRT VA at runtime, since each xrt::bo is placed by the runtime allocator. That follow-up will be submitted as a separate PR with a DCR-based runtime path; this PR is intentionally minimal and out-of-the-box deployable.)


Verification on real Alveo U250

Built with this patch (default DSP FPU, NUM_CORES=2 NUM_WARPS=4 XLEN=64, 200 MHz) on real U250 silicon — XRT 2.19.194, shell xilinx_u250_gen3x16_xdma_4_1_202210_1. The PLP load (xdma_shell_4_1) is required once per cold boot as documented in the README.

Test category Without this PR With this PR
Any kernel start hangs at ap_start, vx_busy=1 forever boots ✅
vecadd (n = 16…16384) n/a (hang) all sizes PASS
dogfood Test0..Test20 (iadd, imul, idiv, fadd, fsub, fmul, fmadd, fnmadd, fdiv, fsqrt, ftoi, itof, fclamp, iclamp, …) n/a PASS
regression: sgemm, dotproduct, demo, dropout, conv3, io_addr, fence, diverge n/a PASS
OpenCL: saxpy, vecadd, sgemm, sgemm2, sgemm3, stencil, sfilter, spmv, psort, oclprintf n/a PASS
single-rank MPI: mpi_vecadd, mpi_dotproduct, mpi_diverge, mpi_put_dotproduct n/a PASS

That spans integer arithmetic, single-precision FP, vector reductions, matrix multiply, convolution, ML-style ops (dropout), and the full OpenCL/MPI execution stacks. Across all of those there is no observed regression introduced by the patch.

(dogfood Test21-trig produces FP64 NaNs because the default DSP FPU is FP32-only — that's a pre-existing FPU-implementation issue, not addressed here.)

Build numbers: total xclbin link 2 h 30 m, WNS = +0.057 ns @ 200 MHz (positive, no impl-strategy tuning), area added by this patch ≈ 4 × 36-bit adders. Negligible.


Backward compatibility

  • PLATFORM_MEMORY_OFFSET_<i> defaults to PLATFORM_MEMORY_OFFSET, so any platform that didn't define per-bank offsets sees the same effective offset on every bank as before. HBM platforms (U280/U55C/U50, all PLATFORM_MERGED_MEMORY_INTERFACE), VCK5000 single-channel, and Zynq UltraScale+ are byte-for-byte identical.
  • The U250 platforms.mk entry changes from NUM_BANKS=4 (no connectivity) to NUM_BANKS=1 (DDR[0] + offset). The previous configuration produced an xclbin that didn't actually run, so this is a strict improvement; any user wishing to revisit four-channel deployment needs the runtime VA plumbing of the follow-up PR anyway.

Files touched

  • hw/rtl/afu/xrt/vortex_afu.vh — +18 / −0
  • hw/rtl/afu/xrt/VX_afu_wrap.sv — +14 / −2
  • hw/syn/xilinx/xrt/platforms.mk — +6 / −2

References: closes (or substantially mitigates) #262, #263, #278.

@tinebp
Copy link
Copy Markdown
Collaborator

tinebp commented May 11, 2026

Great investigation and catch!
Your proposed solution has a hardcoded 4-bank assumption. Can you generalize the fix in the RTL code?

@hwirys
Copy link
Copy Markdown
Author

hwirys commented May 12, 2026

Thank you so much! I'll update general version ASAP

hwirys added a commit to hwirys/vortex that referenced this pull request May 12, 2026
PR review feedback: a08b8d7 used a hardcoded [4] platform_memory_offsets
array with explicit per-bank PLATFORM_MEMORY_OFFSET_0..3 macros, baking
in a 4-bank assumption that doesn't hold for HBM (NUM_BANKS=32) or
single-channel (NUM_BANKS=1) platforms even though the patch was
intended to support all of them.

Changes:
- VX_afu_wrap.sv: platform_memory_offsets array is now sized by
  C_M_AXI_MEM_NUM_BANKS, and the per-bank assignment is a generate-for
  using the single PLATFORM_MEMORY_OFFSET macro applied to every bank.
- vortex_afu.vh: drop the PLATFORM_MEMORY_OFFSET_0..3 numbered macros;
  the single PLATFORM_MEMORY_OFFSET (default 0) is the only knob.
- platforms.mk: U250 uses PLATFORM_MEMORY_OFFSET=40'h4000000000 in
  place of the deprecated PLATFORM_MEMORY_OFFSET_0.

For platforms that need a different VA per bank (e.g. U250 4-bank DDR4),
a runtime DCR mechanism is added in a follow-up commit on master and
is intentionally not part of this PR.

Verified end-to-end on real Alveo U250 (xilinx_u250_gen3x16_xdma_4_1)
via three independent xclbin builds (NUM_CORES=1, KERNEL_FREQ=150 MHz):

- NUM_BANKS=1: 19 PASS / 0 FAIL / 1 known-fail (dogfood Test21-trig,
  pre-existing DSP-FPU FP64 sinf NaN unrelated to bank generalization)
- NUM_BANKS=2: same — 19 PASS / 0 FAIL / 1 known-fail
- NUM_BANKS=4: same — 19 PASS / 0 FAIL / 1 known-fail

Test suite covers diverse memory-access and compute patterns to exercise
the generalized bank addressing:
  Regression (CUDA-style): vecadd, sgemm, sgemm2, dotproduct, diverge,
    demo, conv3, fence, io_addr, dropout, printf, basic, arith, mstress
  OpenCL: saxpy, vecadd, sgemm, sfilter, oclprintf

NUM_BANKS=4 IPC is consistently higher than NUM_BANKS=1 (e.g. vecadd
0.339 vs 0.308, dotproduct 0.367 vs 0.332), confirming the per-bank
parallelism is functioning correctly.

Refs: vortexgpgpu#342

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hwirys
Copy link
Copy Markdown
Author

hwirys commented May 12, 2026

Thanks for the review! Pushed 19bfcdf65 which drops the 4-bank assumption.

Changes

  • hw/rtl/afu/xrt/VX_afu_wrap.sv: platform_memory_offsets array now sized by C_M_AXI_MEM_NUM_BANKS, per-bank assignment via a generate-for loop using the single PLATFORM_MEMORY_OFFSET macro.
  • hw/rtl/afu/xrt/vortex_afu.vh: dropped the numbered PLATFORM_MEMORY_OFFSET_0..3 macros; only PLATFORM_MEMORY_OFFSET remains.
  • hw/syn/xilinx/xrt/platforms.mk: U250 uses the unified PLATFORM_MEMORY_OFFSET=40'h4000000000.

Net: +13 / -29 lines.

For platforms that need a different VA per bank (e.g. U250 4-bank DDR4), a DCR-based runtime push mechanism is on master and is intentionally a separate follow-up — this PR stays scoped to single-bank end-to-end fix + RTL generality.

Verification on real Alveo U250 (xilinx_u250_gen3x16_xdma_4_1)

Three independent xclbin builds at NUM_CORES=1, KERNEL_FREQ=150 MHz, identical RTL, varying PLATFORM_MEMORY_NUM_BANKS:

Config Result
NUM_BANKS=1 19/19 PASS + 1 known-fail
NUM_BANKS=2 19/19 PASS + 1 known-fail
NUM_BANKS=4 19/19 PASS + 1 known-fail

Test coverage (each build):

  • Regression (CUDA-style): vecadd, sgemm, sgemm2, dotproduct, diverge, demo, conv3, fence, io_addr, dropout, printf, basic, arith, mstress
  • OpenCL: saxpy, vecadd, sgemm, sfilter, oclprintf

The known-fail is dogfood Test21-trig (NaN from sinf — pre-existing DSP-FPU FP64 limitation, fixable with -DFPU_FPNEW, unrelated to bank addressing).

NUM_BANKS=4 IPC is consistently higher than NUM_BANKS=1 across memory-bound workloads (e.g. vecadd 0.339 vs 0.308, dotproduct 0.367 vs 0.332), confirming the generalized per-bank addressing distributes traffic correctly.

hwirys added a commit to hwirys/vortex that referenced this pull request May 12, 2026
PR review feedback: a08b8d7 used a hardcoded [4] platform_memory_offsets
array with explicit per-bank PLATFORM_MEMORY_OFFSET_0..3 macros, baking
in a 4-bank assumption that doesn't hold for HBM (NUM_BANKS=32) or
single-channel (NUM_BANKS=1) platforms even though the patch was
intended to support all of them.

Changes:
- VX_afu_wrap.sv: platform_memory_offsets array is now sized by
  C_M_AXI_MEM_NUM_BANKS, and the per-bank assignment is a generate-for
  using the single PLATFORM_MEMORY_OFFSET macro applied to every bank.
- vortex_afu.vh: drop the PLATFORM_MEMORY_OFFSET_0..3 numbered macros;
  the single PLATFORM_MEMORY_OFFSET (default 0) is the only knob.
- platforms.mk: U250 uses PLATFORM_MEMORY_OFFSET=40'h4000000000 in
  place of the deprecated PLATFORM_MEMORY_OFFSET_0.

For platforms that need a different VA per bank (e.g. U250 4-bank DDR4),
a runtime DCR mechanism is added in a follow-up commit on master and
is intentionally not part of this PR.

Verified end-to-end on real Alveo U250 (xilinx_u250_gen3x16_xdma_4_1)
via three independent xclbin builds (NUM_CORES=1, KERNEL_FREQ=150 MHz):

- NUM_BANKS=1: 19 PASS / 0 FAIL / 1 known-fail (dogfood Test21-trig,
  pre-existing DSP-FPU FP64 sinf NaN unrelated to bank generalization)
- NUM_BANKS=2: same — 19 PASS / 0 FAIL / 1 known-fail
- NUM_BANKS=4: same — 19 PASS / 0 FAIL / 1 known-fail

Test suite covers diverse memory-access and compute patterns to exercise
the generalized bank addressing:
  Regression (CUDA-style): vecadd, sgemm, sgemm2, dotproduct, diverge,
    demo, conv3, fence, io_addr, dropout, printf, basic, arith, mstress
  OpenCL: saxpy, vecadd, sgemm, sfilter, oclprintf

NUM_BANKS=4 IPC is consistently higher than NUM_BANKS=1 (e.g. vecadd
0.339 vs 0.308, dotproduct 0.367 vs 0.332), confirming the per-bank
parallelism is functioning correctly.

Refs: vortexgpgpu#342
@hwirys hwirys force-pushed the fix/u250-platform-memory-offset branch from 1c041f0 to 19bfcdf Compare May 12, 2026 07:54
hwirys added a commit to hwirys/vortex that referenced this pull request May 12, 2026
PR vortexgpgpu#342 review feedback: the per-bank XRT VA offset path landed in
79e50a8 with a hardcoded 4-bank assumption (bank_offset_q[4],
LO0..HI3 numbered DCR macros, PLATFORM_MEMORY_OFFSET_0..3
synthesis-time macros, runtime cap at i<4). The RTL should
elaborate for any NUM_BANKS without code changes.

Changes:
- VX_types.vh: drop the LO0..HI3/END numbered macros; keep only the
  parameterized helpers LO(i)/HI(i); add VX_DCR_BASE_BANK_OFFSET_MAX
  (=32) as the DCR address-space cap and width-clean the helpers
  with a VX_DCR_ADDR_BITS' size cast on the i-dependent term.
- VX_afu_wrap.sv: bank_offset_q and platform_memory_offsets arrays
  size to C_M_AXI_MEM_NUM_BANKS; per-bank reset/snoop becomes a
  generate-for using localparam DCR_LO/HI for the address compare,
  reset value comes from the single PLATFORM_MEMORY_OFFSET macro.
- vortex_afu.vh: drop PLATFORM_MEMORY_OFFSET_0..3 (each bank now
  resets to PLATFORM_MEMORY_OFFSET).
- runtime/xrt/vortex.cpp: cap relaxed from i<4 to
  i<VX_DCR_BASE_BANK_OFFSET_MAX.
- gen_config.py: add a translation rule for SV size casts
  (12'(x) / IDENT'(x)) so the width-clean helper emits plain C
  arithmetic in VX_types.h.

Bit-identical to prior 4-bank behaviour: the generate-for with
NUM_BANKS=4 produces the same per-bank reset and DCR snoop as the
explicit case form.

Verified on real Alveo U250 platform via parallel gen-xo synthesis
for NUM_BANKS={1, 2, 4} (NUM_CORES=1): all three .xo files build
cleanly, and kernel.xml emits the right number of m_axi_mem_<i>
ports and MEM_<i> args per build. Verilator elaboration: no new
warnings introduced (the two pre-existing VX_sp_ram/VX_dp_ram
REDEFMACRO are unrelated).

Refs: vortexgpgpu#342
hwirys added 2 commits May 12, 2026 21:34
When Vortex is built for an XRT-based Xilinx platform that allocates
each memory bank as a separate xrt::bo, XRT picks a different virtual
address per bank (e.g. on U250 bank 0 lands at 0x40_00000000) — which
is far above Vortex's compile-time absolute addresses STARTUP_ADDR
= 0x180000000 and STACK_BASE_ADDR = 0x1FFFF0000. AXI requests from
Vortex therefore fail to decode at the slave, vx_busy stays high
forever, and the kernel never starts. This is the underlying cause
of the long-running U250 hang reports (vortexgpgpu#262, vortexgpgpu#263, vortexgpgpu#278) — Vortex
2.3 added the U250 build path but did not address this BAR mismatch,
so the produced xclbin has never actually booted on real silicon.

Fix
- vortex_afu.vh: introduce per-bank PLATFORM_MEMORY_OFFSET_<i>
  (i = 0..3) macros, each defaulting to the legacy global
  PLATFORM_MEMORY_OFFSET so HBM platforms (U280/U55C/U50) and
  VCK5000 single-channel are byte-for-byte unchanged.
- VX_afu_wrap.sv: build a 4-entry platform_memory_offsets array
  from those macros and add the bank-i offset to each outgoing
  m_axi_mem_<i> AW/AR address.
- platforms.mk U250: switch to single-bank (NUM_BANKS=1, DDR[0])
  and set PLATFORM_MEMORY_OFFSET_0=40'h4000000000 so the build
  works end-to-end out of the box. Multi-bank (full 64 GB) deployment
  needs a runtime mechanism to push each bo's actual XRT VA into the
  AFU and will follow as a separate PR.

Verified end-to-end on real Alveo U250 (XRT 2.19.194, shell
xilinx_u250_gen3x16_xdma_4_1_202210_1) at 200 MHz with the default
DSP FPU. Without this patch the kernel hangs immediately at
ap_start (vx_busy stuck high, CTL register reads back 0x1
indefinitely). With the patch the kernel boots and:

- regression `vecadd` (n=16..16384), `sgemm`, `dotproduct`, `demo`,
  `dropout`, `conv3`, `io_addr`, `fence`, `diverge` — all pass
- `dogfood` Test0..Test20 — pass
- OpenCL: `saxpy`, `vecadd`, `sgemm`, `sgemm2`, `sgemm3`, `stencil`,
  `sfilter`, `spmv`, `psort`, `oclprintf` — pass
- single-rank MPI: `mpi_vecadd`, `mpi_dotproduct`, `mpi_diverge`,
  `mpi_put_dotproduct` — pass

Final WNS = +0.057 ns at 200 MHz; the patch is purely combinational
addressing logic (3 extra adds in the AFU) so area and timing impact
are negligible.

Refs: vortexgpgpu#262, vortexgpgpu#263, vortexgpgpu#278.
PR review feedback: a08b8d7 used a hardcoded [4] platform_memory_offsets
array with explicit per-bank PLATFORM_MEMORY_OFFSET_0..3 macros, baking
in a 4-bank assumption that doesn't hold for HBM (NUM_BANKS=32) or
single-channel (NUM_BANKS=1) platforms even though the patch was
intended to support all of them.

Changes:
- VX_afu_wrap.sv: platform_memory_offsets array is now sized by
  C_M_AXI_MEM_NUM_BANKS, and the per-bank assignment is a generate-for
  using the single PLATFORM_MEMORY_OFFSET macro applied to every bank.
- vortex_afu.vh: drop the PLATFORM_MEMORY_OFFSET_0..3 numbered macros;
  the single PLATFORM_MEMORY_OFFSET (default 0) is the only knob.
- platforms.mk: U250 uses PLATFORM_MEMORY_OFFSET=40'h4000000000 in
  place of the deprecated PLATFORM_MEMORY_OFFSET_0.

For platforms that need a different VA per bank (e.g. U250 4-bank DDR4),
a runtime DCR mechanism is added in a follow-up commit on master and
is intentionally not part of this PR.

Verified end-to-end on real Alveo U250 (xilinx_u250_gen3x16_xdma_4_1)
via three independent xclbin builds (NUM_CORES=1, KERNEL_FREQ=150 MHz):

- NUM_BANKS=1: 19 PASS / 0 FAIL / 1 known-fail (dogfood Test21-trig,
  pre-existing DSP-FPU FP64 sinf NaN unrelated to bank generalization)
- NUM_BANKS=2: same — 19 PASS / 0 FAIL / 1 known-fail
- NUM_BANKS=4: same — 19 PASS / 0 FAIL / 1 known-fail

Test suite covers diverse memory-access and compute patterns to exercise
the generalized bank addressing:
  Regression (CUDA-style): vecadd, sgemm, sgemm2, dotproduct, diverge,
    demo, conv3, fence, io_addr, dropout, printf, basic, arith, mstress
  OpenCL: saxpy, vecadd, sgemm, sfilter, oclprintf

NUM_BANKS=4 IPC is consistently higher than NUM_BANKS=1 (e.g. vecadd
0.339 vs 0.308, dotproduct 0.367 vs 0.332), confirming the per-bank
parallelism is functioning correctly.

Refs: vortexgpgpu#342
@hwirys hwirys force-pushed the fix/u250-platform-memory-offset branch from 19bfcdf to 07b165b Compare May 12, 2026 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants