Initial RDNA Windows bring-up for CK FMHA by 0xDELUXA · Pull Request #86 · ROCm/xformers

0xDELUXA · 2026-04-27T14:41:34Z

#84 reopened in develop after test_whole_k_prefetch_n0loop was merged and deleted. See #83 (comment) for context.

What does this PR do?

Progress on #83.

Enables the CK-tile FMHA kernel family on RDNA3 and RDNA4 to be built and imported on Windows with TheRock ROCm. No changes to Linux/CDNA behavior are introduced.

Test coverage

What was tested: #84 (comment)

What was not tested:

RDNA3. Only RDNA4 (gfx1200) was tested. The code path for RDNA3 is symmetric but unvalidated.
Linux RDNA. Only Windows + TheRock was tested.
All existing CDNA build paths. Changes are either CDNA-irrelevant (Windows-gated, RDNA-gated) or source-compatible (hipcc discovery, ninja cache fix, torch-based arch fallback only triggers when rocm_agent_enumerator is absent).

Reproducing the tested configuration

:: Inside a TheRock venv on Windows 11 with MSVC 2022 + gfx1200

call "...\VC\Auxiliary\Build\vcvars64.bat"
call <venv>\Scripts\activate.bat

for /f "delims=" %i in ('rocm-sdk path --root') do set ROCM_ROOT=%i
set ROCM_HOME=%ROCM_ROOT%
set ROCM_PATH=%ROCM_ROOT%
set HIP_PATH=%ROCM_ROOT%
set PATH=%ROCM_ROOT%\lib\llvm\bin;%ROCM_ROOT%\bin;%PATH%

set CC=clang-cl
set CXX=clang-cl
set DISTUTILS_USE_SDK=1

:: optional:
:: `set XFORMERS_DISABLE_BACKWARD=1` - forward-only, saves a lot of time
:: use XFORMERS_SELECTIVE_BUILD to compile only a single instance:
:: `set XFORMERS_SELECTIVE_BUILD=fmha_batched_infer_fp16_no_mask_no_bias_no_dropout_maxk_64`
:: `set HIP_ARCHITECTURE=gfx1200` - build for a specific architecture

python setup.py build_ext --inplace -j 12

Additional changes required for a successful RDNA build on Windows:

cc @schung-amd @qianfengz @sstamenk

Based on @schung-amd’s “any help is welcome” statement in #83 (comment), I opened this PR to demonstrate the changes needed to build xFormers CK on RDNA/Windows. Would appreciate any review or feedback.

jammm · 2026-04-30T08:57:35Z

@schung-amd PTAL :)

jammm · 2026-04-30T09:01:07Z

@0xDELUXA thanks for the PR ! I've requested for reviews.
BTW, I noticed the testing you did was focused on performance. Can you to the same for numerical correctness as well? We want to make sure xformers doesn't produce noisy output given the new WMMA tile sizes introduced.

EDIT: Ah nvm, I just noticed the 15/15 correctness tests passing. Great work!

qianfengz

@jammm
How to you verify your implementation? I usually use the following testing to ensure the implementation is acceptable:

#> pytest tests/test_mem_eff_attention.py::test_forward
#> pytest tests/test_mem_eff_attention.py::test_backward
#> pytest tests/test_mem_eff_attention.py::test_dropout_ck

All testing cases not skipped should have passed.

jammm · 2026-04-30T09:45:09Z

@jammm How to you verify your implementation?

This isn't my implementation, it's @0xDELUXA's

I usually use the following testing to ensure the implementation is acceptable:
* #> pytest tests/test_mem_eff_attention.py::test_forward

* #> pytest tests/test_mem_eff_attention.py::test_backward

* #> pytest tests/test_mem_eff_attention.py::test_dropout_ck
All testing cases not skipped should have passed.

I agree. @0xDELUXA have you ran these tests and verified they pass?

0xDELUXA · 2026-04-30T10:07:34Z

@jammm How to you verify your implementation?

This isn't my implementation, it's @0xDELUXA's
I usually use the following testing to ensure the implementation is acceptable:
* #> pytest tests/test_mem_eff_attention.py::test_forward

* #> pytest tests/test_mem_eff_attention.py::test_backward

* #> pytest tests/test_mem_eff_attention.py::test_dropout_ck
All testing cases not skipped should have passed.
I agree. @0xDELUXA have you ran these tests and verified they pass?

I'll run these, but I'm currently AFK.

0xDELUXA · 2026-04-30T19:16:06Z

pytest tests/test_mem_eff_attention.py::test_forward
=================================================================== 1058 failed, 266 passed, 3572 skipped, 140 warnings in 243.93s (0:04:03) ===================================================================

pytest tests/test_mem_eff_attention.py::test_backward
=================================================================== 1518 failed, 842 passed, 1672 skipped, 87 warnings in 1436.15s (0:23:56) ===================================================================

pytest tests/test_mem_eff_attention.py::test_dropout_ck
====================================================================================== 160 failed, 12 warnings in 31.97s =======================================================================================

The full output was n+1 lines, I can realistically paste everything here.

In my opinion, these tests aren’t meant to be run on RDNA/Windows.

@qianfengz, are you sure everything is expected to pass?

0xDELUXA · 2026-04-30T19:33:15Z

@jammm Thank you for the review comments.
I would really appreciate it if someone from AMD could assist with moving this PR forward, similar to @micmelesse’s involvement in ROCm/aiter#2433 and Dao-AILab/flash-attention#2385.
On my own PC, I couldn’t even build CK - it would’ve taken forever.
This PR serves as a demonstration of a potential approach to building xformers CK on RDNA/Windows. The changes may be incomplete or not entirely correct.

qianfengz · 2026-05-01T15:23:18Z

pytest tests/test_mem_eff_attention.py::test_forward
=================================================================== 1058 failed, 266 passed, 3572 skipped, 140 warnings in 243.93s (0:04:03) ===================================================================

pytest tests/test_mem_eff_attention.py::test_backward
=================================================================== 1518 failed, 842 passed, 1672 skipped, 87 warnings in 1436.15s (0:23:56) ===================================================================

pytest tests/test_mem_eff_attention.py::test_dropout_ck
====================================================================================== 160 failed, 12 warnings in 31.97s =======================================================================================

The full output was n+1 lines, I can realistically paste everything here.

In my opinion, these tests aren’t meant to be run on RDNA/Windows.

@qianfengz, are you sure everything is expected to pass?

Yes, on both gfx942 and gfx950, all of the three testing have no failed cases.

0xDELUXA · 2026-05-01T16:00:37Z

Yes, on both gfx942 and gfx950, all of the three testing have no failed cases.

I see. However, I think that xformers CK cannot work exactly the same on RDNA as it does on CDNA. For example, I encountered several issues like 'not implemented on FA2, use FA3 instead', and I don’t think these are relevant to my PR.

Would you be able to review my complete test output and point out what this PR might be missing or doing wrong?

jammm · 2026-05-01T17:05:22Z

Yes, on both gfx942 and gfx950, all of the three testing have no failed cases.

I see. However, I think that xformers CK cannot work exactly the same on RDNA as it does on CDNA. For example, I encountered several issues like 'not implemented on FA2, use FA3 instead', and I don’t think these are relevant to my PR.

Would you be able to review my complete test output and point out what this PR might be missing or doing wrong?

Can you share the full test output?

0xDELUXA · 2026-05-01T17:33:14Z

Can you share the full test output?

I’ll rerun the tests and save the output to text files.

However, there’s one thing I don’t quite understand: I have FA 2.8.4 (with the aiter triton backend) installed, but these tests fail right away, due to this.
In my opinion, it doesn’t really seem right to require a downgrade to FA 2.8.0 or earlier for xformers' FA2 backend to work.
I know this isn't intended, but I temporarily changed that line to FLASH_VER_LAST = (2, 8, 4) before running the test scripts to prevent the version error and see what would happen.

I also have FA3 installed, after realizing it also works on RDNA, as noted in Dao-AILab/flash-attention#2178 (comment). I then contributed to enabling FP8 support for FA3 on RDNA4 in ROCm/aiter#2621.

All in all, some failures may be related to my local system setup.
I'll share the full output shortly - but my machine is somewhat slow, so it may take a moment (around half an hour to run all three tests).

0xDELUXA · 2026-05-01T18:13:29Z

I'm not quite sure what happened, but after upgrading to PyTorch version 2.13.0a0+rocm7.13.0a20260429, I'm getting different results than before: https://gist.github.com/0xDELUXA/540caf952bf2c2bd01552beb3664238f.
Now every failure is FA-related.
One question is: should test_dropout_ck be entirely skipped on RDNA/Windows?

If needed, I can provide the full output (~4.5M+ characters).

jammm · 2026-05-02T06:50:08Z

I've had gpt5.5 fix the test failures. Will push it to another branch on top of your changes soon.

jammm · 2026-05-02T07:22:11Z

@0xDELUXA can you cherry-pick this commit ? fb718c2 it's in this branch https://github.com/ROCm/xformers/tree/users/jam/pr-86-rdna-fmha

@qianfengz we need this CK PR to be merged then a CK bump in order for all tests to pass ROCm/rocm-libraries#7016

0xDELUXA · 2026-05-02T11:31:39Z

@0xDELUXA can you cherry-pick this commit ? fb718c2 it's in this branch https://github.com/ROCm/xformers/tree/users/jam/pr-86-rdna-fmha

Sure, I’ve cherry-picked it. Please take a look in case anything unexpected slipped in.

0xDELUXA · 2026-05-02T11:58:58Z

Rebased to latest develop.

jammm · 2026-05-03T06:59:27Z

Couple things left before we can land this PR:

[CK] Fix RDNA3/RDNA4 FMHA tile-load paths rocm-libraries#7016 needs to be merged
This/another PR needs to switch the CK submodule to the new CK develop branch after assistant-librarian syncs from rocm-libraries to https://github.com/ROCm/composable_kernel.
All relevant CI passing

0xDELUXA · 2026-05-03T15:24:41Z

Couple things left before we can land this PR:

Fix gfx12 async buffer load fallback rocm-libraries#7016 needs to be merged

This/another PR needs to switch the CK submodule to the new CK develop branch after assistant-librarian syncs from rocm-libraries to https://github.com/ROCm/composable_kernel.

All relevant CI passing

Sounds good! Thanks again for your involvement here, @jammm.

What do you think about this plan, @qianfengz, @schung-amd?
Enabling xformers CK/FMHA support on RDNA/Windows would be great, and it would also close #83.

qianfengz · 2026-05-06T09:49:29Z

Couple things left before we can land this PR:

Fix gfx12 async buffer load fallback rocm-libraries#7016 needs to be merged

This/another PR needs to switch the CK submodule to the new CK develop branch after assistant-librarian syncs from rocm-libraries to https://github.com/ROCm/composable_kernel.

All relevant CI passing

Sounds good! Thanks again for your involvement here, @jammm.

What do you think about this plan, @qianfengz, @schung-amd? Enabling xformers CK/FMHA support on RDNA/Windows would be great, and it would also close #83.

For async_load issue on gfx12, I can have a work-around from xformers. Like the following patch

diff --git a/xformers/csrc/attention/hip_fmha/ck_tiled_fmha_batched_infer_dispatch.h b/xformers/csrc/attention/hip_fmha/ck_tiled_fmha_batched_infer_dispatch.h
index ba600008..9e317489 100644
--- a/xformers/csrc/attention/hip_fmha/ck_tiled_fmha_batched_infer_dispatch.h
+++ b/xformers/csrc/attention/hip_fmha/ck_tiled_fmha_batched_infer_dispatch.h
@@ -34,6 +34,12 @@ struct batched_infer_mask_bias_dropout_dispatch {
   static constexpr bool kTrLoadAvailable = false;
 #endif

+#if defined(FMHA_BUILD_ON_GFX12)
+  static constexpr bool async_pipeline_not_used_by_gfx12 = true;
+#else
+  static constexpr bool async_pipeline_not_used_by_gfx12 = false;
+#endif
+
 #if defined(FMHA_BUILD_ON_GFX950)
   // seq_len runtime threshold for switching fmha_fwd_v3 and qr_async_tr_load
   // pipeline on gfx950.
@@ -173,7 +179,8 @@ struct batched_infer_mask_bias_dropout_dispatch {
     // buffer_load_dwordxx/buffer_store_dwordxx can handle oob access
     constexpr bool kPadSeqLenQ = false;

-    if (!(use_async_pipeline && enable_async_pipeline)) {
+    if (!(use_async_pipeline && enable_async_pipeline) ||
+        async_pipeline_not_used_by_gfx12) {
       using FmhaShape = typename std::conditional_t<
           kUseWholeKPrefetchPipeline,
           FmhaFwdWholeKPrefetchShape<MaxK, MTile>,
@@ -273,7 +280,8 @@ struct batched_infer_mask_bias_dropout_dispatch {
       const bool pad_seqlen_k = !(param.N % FmhaShape::kN0 == 0);

       BOOL_SWITCH(pad_seqlen_k, kPadSeqLenK, [&] {
-        if constexpr (MaxK <= 128 && MTile <= 128) {
+        if constexpr (
+            !async_pipeline_not_used_by_gfx12 && MaxK <= 128 && MTile <= 128) {
           using FmhaTraits = ck_tile::TileFmhaTraits<
               true, // kPadSeqLenQ,
               kPadSeqLenK,
diff --git a/xformers/csrc/attention/hip_fmha/ck_tiled_fmha_grouped_infer_dispatch.h b/xformers/csrc/attention/hip_fmha/ck_tiled_fmha_grouped_infer_dispatch.h
index 9cc2ce6b..876a8ac6 100644
--- a/xformers/csrc/attention/hip_fmha/ck_tiled_fmha_grouped_infer_dispatch.h
+++ b/xformers/csrc/attention/hip_fmha/ck_tiled_fmha_grouped_infer_dispatch.h
@@ -34,6 +34,12 @@ struct grouped_infer_mask_bias_dropout_dispatch {
   static constexpr bool kTrLoadAvailable = false;
 #endif

+#if defined(FMHA_BUILD_ON_GFX12)
+  static constexpr bool async_pipeline_not_used_by_gfx12 = true;
+#else
+  static constexpr bool async_pipeline_not_used_by_gfx12 = false;
+#endif
+
 #if defined(FMHA_BUILD_ON_GFX950)
   // seq_len runtime threshold for switching fmha_fwd_v3 and qr_async_tr_load
   // pipeline on gfx950.
@@ -174,7 +180,8 @@ struct grouped_infer_mask_bias_dropout_dispatch {
         (!kHasBias && (param.K % 8 == 0) && (param.Kv % 8 == 0) &&
          (MaxK <= 128 && MTile <= 128));

-    if (!(use_async_pipeline && enable_async_pipeline)) {
+    if (!(use_async_pipeline && enable_async_pipeline ||
+          async_pipeline_not_used_by_gfx12)) {
       using FmhaShape = typename std::conditional_t<
           kUseWholeKPrefetchPipeline,
           FmhaFwdWholeKPrefetchShape<MaxK, MTile>,
@@ -262,7 +269,8 @@ struct grouped_infer_mask_bias_dropout_dispatch {
             }
           });
     } else {
-      if constexpr (MaxK <= 128 && MTile <= 128) {
+      if constexpr (
+          !async_pipeline_not_used_by_gfx12 && MaxK <= 128 && MTile <= 128) {
         using FmhaShape = typename FmhaFwdCommonShape<MaxK, MTile>::Type;

         using FmhaTraits = ck_tile::TileFmhaTraits<

qianfengz · 2026-05-06T09:53:12Z

Also, supporting gfx11 should be more challenging due to the issue described by PermuteWarpGemmCToA since not all pipeline has this issue considered

jammm · 2026-05-06T13:12:24Z

@0xDELUXA can you include these new commits as well ? 0xDELUXA/xformers_win-rocm@rdna-win-bringup...ROCm:xformers:users/jam/pr-86-rdna-fmha

Thanks to @brockhargreaves-amd - we can confirm it works fine on Windows with the above branch + the CK PR at ROCm/rocm-libraries#7016

0xDELUXA · 2026-05-06T14:05:46Z

@jammm Could you please tell me the exact steps I should follow? I keep getting errors when I try to cherry-pick your commits from https://github.com/ROCm/xformers/tree/users/jam/pr-86-rdna-fmha.

It would probably be easier for me if you opened a PR against https://github.com/0xDELUXA/xformers_win-rocm/tree/rdna-win-bringup - then I could just merge it, and it would be included in this PR as well.

jammm · 2026-05-06T14:22:33Z

@jammm Could you please tell me the exact steps I should follow? I keep getting errors when I try to cherry-pick your commits from https://github.com/ROCm/xformers/tree/users/jam/pr-86-rdna-fmha.

It would probably be easier for me if you opened a PR against https://github.com/0xDELUXA/xformers_win-rocm/tree/rdna-win-bringup - then I could just merge it, and it would be included in this PR as well.

Perhaps it'd be better if I make a new PR and we close this one? Your original commits will be retained in the new PR. Would that be okay with you?

Alternatively, you can simply checkout my branch directly and force-push it to your branch.

0xDELUXA · 2026-05-06T14:38:17Z

Perhaps it'd be better if I make a new PR and we close this one? Your original commits will be retained in the new PR. Would that be okay with you?

Sure!

I think the issue I ran into is that the submodule commits referenced in your branch aren't publicly available yet, since ROCm/rocm-libraries#7016 is still open.

0xDELUXA force-pushed the rdna-win-bringup branch from e317db2 to 223c765 Compare April 27, 2026 14:44

0xDELUXA marked this pull request as ready for review April 27, 2026 14:44

This was referenced Apr 27, 2026

Initial RDNA Windows bring-up for CK FMHA #84

Closed

fix msvc linker bug for huge command line arguments pypa/distutils#406

Open

jammm requested review from qianfengz and schung-amd April 30, 2026 08:58