You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/training/distributed_inference.md
+64-24Lines changed: 64 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -237,6 +237,9 @@ By selectively loading and unloading the models you need at a given stage and sh
237
237
238
238
Use [`~ModelMixin.set_attention_backend`] to switch to a more optimized attention backend. Refer to this [table](../optimization/attention_backends#available-backends) for a complete list of available backends.
239
239
240
+
> [!NOTE]
241
+
> Most attention backends are compatible with context parallelism. If one is not compatibel with context parallelism, please [file a feature request](https://github.com/huggingface/diffusers/issues/new).
242
+
240
243
### Ring Attention
241
244
242
245
Key (K) and value (V) representations communicate between devices using [Ring Attention](https://huggingface.co/papers/2310.01889). This ensures each split sees every other token's K/V. Each GPU computes attention for its local K/V and passes it to the next GPU in the ring. No single GPU holds the full sequence, which reduces communication latency.
@@ -245,40 +248,56 @@ Pass a [`ContextParallelConfig`] to the `parallel_config` argument of the transf
245
248
246
249
```py
247
250
import torch
248
-
from diffusers import AutoModel, QwenImagePipeline, ContextParallelConfig
The script above needs to be run with a distributed launcher that is compatible with PyTorch. You can use `torchrun` for this: `torchrun --nproc-per-node 2 above_script.py`. `--nproc-per-node` depends on the number of GPUs available.
300
+
282
301
### Ulysses Attention
283
302
284
303
[Ulysses Attention](https://huggingface.co/papers/2309.14509) splits a sequence across GPUs and performs an *all-to-all* communication (every device sends/receives data to every other device). Each GPU ends up with all tokens for only a subset of attention heads. Each GPU computes attention locally on all tokens for its head, then performs another all-to-all to regroup results by tokens for the next layer.
@@ -288,5 +307,26 @@ finally:
288
307
Pass the [`ContextParallelConfig`] to [`~ModelMixin.enable_parallelism`].
0 commit comments