Adding support for fused_moe_gmm by NicoGrande · Pull Request #3627 · AI-Hypercomputer/maxtext

NicoGrande · 2026-04-09T22:18:56Z

Description

This PR adds support for the tpu-inferece fused_moe_gmm kernel in the MaxText MoE inference codepath. Initial results using this kernel show up to ~4x generation throughput increase when testing with qwen3-30b-a3b.

Additionally, this PR introduces a second optimization to MaxText which pre-fuses the MoE weight kernels such that they can be efficiently passed into the fused_moe_gmm kernel. We show the impact of these optimizations below with autoregressive generation step times:

Baseline (MaxText sparse_matmul MoE): 28.353 ms

Fused MoE (prefuse_moe_weights=False): 20.432 ms

Fused MoE (prefuse_moe_weights=True): 6.114 ms

Tests

This PR adds new tests to tests/unit/moe_test.py. Additionally this PR was tested e2e with both qwen3-30b-a3b and qwen3-235b-a22b.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-04-09T22:25:03Z

Codecov Report

❌ Patch coverage is 45.07042% with 39 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/layers/moe.py	23.33%	20 Missing and 3 partials ⚠️
src/maxtext/utils/model_creation_utils.py	60.97%	13 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

RissyRan

Overall LGTM. Could you add your tests to showcase the functionality? Thanks!

RissyRan · 2026-04-09T23:46:06Z

+from tests.utils.test_helpers import get_test_config_path, get_decoupled_parallelism_overrides
+
+
+def make_moe(cfg, mesh):


Do you think we could move this to https://github.com/AI-Hypercomputer/maxtext/blob/7e3f19ff7828a75322cbf19e27374d6a6324aaad/tests/unit/moe_test.py?

Done :)

LMK what you think!

gobbleturk · 2026-04-10T19:07:29Z

+                      ['moe_mlp', ['model', 'attn_dp']],
                      ['vocab', ['model', 'attn_dp']],
-                      ['heads', ['model']],
+                      ['heads', ['model', 'expert']],


I would add a comment here that expert is intended to act like TP for attention

Can even explicitly say we target two all reduces, one at end of attention out_proj, one at end of mlp

gobbleturk · 2026-04-10T19:08:33Z

+                      ['activation_length_no_exp_moe', 'data'],
                      ['activation_q_length', ['expert', 'attn_dp_expert']],
                      ['activation_attn_embed', 'model'],
                      ['activation_embed', ['model', 'attn_dp']],


I would add a note here for activation_embed that expert is missing explicitly despite using TP because we are going for a replicate-AR style of TP as opposed to our typical AG-RS style of TP - the replicate-AR is sort of forced by the output sharding of the VLLM kernel

gobbleturk · 2026-04-10T19:10:46Z

        kernel_axes=self.kernel_axes,
        use_bias=self.config.routed_bias,
-        score_func=self.config.routed_score_func,
+        score_func="" if self.config.attention == "vllm_rpa" else self.config.routed_score_func,


I would include a comment or maybe a link to short documentation/bug/vllm code that says the the vllm mega kernel we are calling does the score_func for us so we don't want to apply it ourselves

gobbleturk · 2026-04-10T19:12:41Z

      w0_bias, w1_bias, wo_bias = None, None, None

-    if cfg.sparse_matmul:
+    # vllm_rpa uses fused_moe_func from tpu_inference (highest priority)


what does "highest priority" mean here?

Was just trying to say it will use this by default - updating the comment for clarify.

gobbleturk · 2026-04-10T19:15:13Z

+    self.assertIsNone(lb_loss)
+    self.assertIsNone(bias_updates)
+
+  def test_fused_vs_sparse_softmax(self):


wow this is a great test!

gobbleturk

Awesome!

NuojCheng · 2026-04-14T18:22:35Z

+    )
+
+    # Reshape output 2D [T, D] -> 3D [B, S, D]
+    output = jnp.reshape(output_2d, (batch_size, seq_len, emb_dim))


nit: I would add a sharding hint here for output, e.g.

output = nn.with_logical_constraint(output, ("activation_batch", "activation_length", "activation_embed"))

but it is optional.

NuojCheng · 2026-04-14T18:25:07Z

+# fused_moe_func requires num_tokens * topk % 16 == 0.
+# B=1, S=16, topk=2 -> T*topk = 32, divisible by 16.
+_B = 1
+_S = 16


is it possible to move these two values inside FusedMoeTPUTest?

NuojCheng

LGTM

NicoGrande force-pushed the nicogrande/fused-moe-gmm branch 2 times, most recently from 4a22680 to 814348f Compare April 9, 2026 22:32

RissyRan reviewed Apr 9, 2026

View reviewed changes

NicoGrande force-pushed the nicogrande/fused-moe-gmm branch 4 times, most recently from e2a24f7 to b5d61be Compare April 10, 2026 17:56

NicoGrande requested review from gpolovets1 and mitalisi as code owners April 10, 2026 17:56

NicoGrande requested review from Lumosis, jrplatin, mailvijayasingh and patemotter as code owners April 10, 2026 17:56

gobbleturk reviewed Apr 10, 2026

View reviewed changes

gobbleturk approved these changes Apr 10, 2026

View reviewed changes

NicoGrande force-pushed the nicogrande/fused-moe-gmm branch from b5d61be to a6cfd99 Compare April 11, 2026 22:21

NuojCheng reviewed Apr 13, 2026

View reviewed changes

Comment thread src/maxtext/configs/inference/vllm.yml

NicoGrande force-pushed the nicogrande/fused-moe-gmm branch 2 times, most recently from 14e5854 to 57d1a60 Compare April 14, 2026 00:27

adding support for fused_moe_gmm

be95bcf

NicoGrande force-pushed the nicogrande/fused-moe-gmm branch from 57d1a60 to be95bcf Compare April 14, 2026 00:30

NuojCheng reviewed Apr 14, 2026

View reviewed changes

NuojCheng approved these changes Apr 14, 2026

View reviewed changes

		from tests.utils.test_helpers import get_test_config_path, get_decoupled_parallelism_overrides


		def make_moe(cfg, mesh):

Conversation

NicoGrande commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gobbleturk Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gobbleturk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NuojCheng left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NicoGrande commented Apr 9, 2026 •

edited

Loading

codecov bot commented Apr 9, 2026 •

edited

Loading

gobbleturk Apr 10, 2026 •

edited

Loading