You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/reference/core_concepts/moe_configuration.md
-5Lines changed: 0 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -96,11 +96,6 @@ Dropping:
96
96
97
97
## 2. Sharding
98
98
99
-
`expert_shard_attention_option`: Determines how the "expert" axis is interpreted when sharding attention layers. Options include:
100
-
101
-
-`fsdp`: Treats the expert axis as a FSDP axis.
102
-
-`context`: Treats the expert axis as a context parallelism axis, useful for long context.
103
-
104
99
`use_ring_of_experts` (experimental): This feature requires expert parallelism. If enabled, it replaces the standard two All-to-All communications with All-Gather in dispatch and Reduce-Scatter in collect. By gathering inputs across all shards, it allows for local routing and Top-K calculations, followed by result aggregation via Reduce-Scatter. This approach is particularly effective for models with a large Top-K, as it gathers activations before they are replicated k times to reduce communication.
105
100
106
101
`moe_fsdp_use_two_stage_all_gather`: If enabled, split the All-Gather operation for MoE weights into two separate stages when using FSDP/FSDP-transpose sharding. This is preferred when 3D All-Gather support is unavailable.
0 commit comments