[Agents] Optimizations skill by asomoza · Pull Request #13381 · huggingface/diffusers

asomoza · 2026-04-01T14:42:10Z

What does this PR do?

Add a skill for all the optimizations that can be done with diffusers.

At the moment it has a lot of information and still with this the LLM (Opus 4.6) is still a hit or miss with the optimizations, specially with windows that doesn't give OOMs.

It does make the pipelines run though which is the purpose, just not with the best combination IMO.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul

sayakpaul

Thanks a lot!

My main comments are about reusing the existing documentation as much as we can and removing duplicity which might be causing confusion for the agent.

It'd then be nice to include some examples where this skill was useful so that we can build trust amongst users.

sayakpaul · 2026-04-14T11:47:06Z

+| Backend | Key requirement | Best for |
+|---|---|---|
+| `torch_sdpa` (default) | PyTorch >= 2.0 | General use; auto-selects FlashAttention or memory-efficient kernels |
+| `flash_attention_2` | `flash-attn` package, Ampere+ GPU | Long sequences, training, best raw throughput |
+| `xformers` | `xformers` package | Older GPUs, memory-efficient attention |
+| `flex_attention` | PyTorch >= 2.5 | Custom attention masks, block-sparse patterns |
+| `sage_attention` | `sageattention` package | INT8 quantized attention for inference speed |


We should take this information from https://huggingface.co/docs/diffusers/main/en/optimization/attention_backends.

Prefer using flash_hub instead of flash_attention_2, etc.

Should also include _flash_3_hub.

We could also just refer Claude to ./docs/source/en/optimization/attention_backends.md so that it always has the latest info. I think this would be better. WDYT?

oh yeah, for sure, the attention backends was mostly added by claude itself, since it wasn't my priority for this PR and also I didn't test this part. It for sure can be a link to the docs.

oh wait, if we just install the diffusers wheel, we don't have the documentation, where are you expecting the model to get the docs? Also fetch online will prevent the skill to be used offline.

I also thought that this makes sense, but only if you have cloned the repo, it doesn't work for just a wheel install.

Then we should have the users clone a diffusers copy as it's just better and far simpler than to duplicate content.

So, if it's offline, then prompt the users to clone a diffusers copy. If it's not then a fetch operation should suffice.

But for offline agents, how would they access the skill as these are also not packaged right?

I now see what you meant, IMO this is more something that we should publish in this repo than here. But yeah, if we assume they will use the skill here, we can just link to the .md local docs.

Yeah for now, I think it's valuable keep the skills repo-specific. This way, they are more easily discoverable.

But yeah, if we assume they will use the skill here, we can just link to the .md local docs.

That's a fair assumption I guess? How is the skill accessed otherwise then? Can we install it or something?

yeah, skills can be installed, hf also has a cli installer, you can install skills for a project, for your user or for enterprise. The project ones are the rarest ones and usually for working on that project not for using them.

Maybe we can ship without having to install first and based on the feedback we can iterate? I don't think things will change too much. WDYT?

yeah, lets go with that

sayakpaul · 2026-04-14T11:47:45Z

+```python
+# Global default
+from diffusers import set_attention_backend
+set_attention_backend("flash_attention_2")
+
+# Per-model
+pipe.transformer.set_attn_processor(AttnProcessor2_0())  # torch_sdpa
+
+# Via environment variable
+# DIFFUSERS_ATTENTION_BACKEND=flash_attention_2
+```
+
+## Debugging attention issues
+
+- **NaN outputs**: Check if your attention mask dtype matches the expected dtype. Some backends require `bool`, others require float masks with `-inf` for masked positions.
+- **Speed regression**: Profile with `torch.profiler` to verify the expected kernel is actually being dispatched. SDPA can silently fall back to the math kernel.
+- **Memory spike**: FlashAttention-2 is memory-efficient for long sequences but has overhead for very short ones. For short sequences, `torch_sdpa` with math fallback may use less memory.
+
+## Implementation notes
+
+- Models integrated into diffusers should use `dispatch_attention_fn` (not `F.scaled_dot_product_attention` directly) so that backend switching works automatically.
+- See the attention pattern in the `model-integration` skill for how to implement this in new models.


I think we won't need this if we link to the attention backends documentation?

sayakpaul · 2026-04-14T11:49:31Z

@@ -0,0 +1,68 @@
+# Layerwise Casting
+
+## Overview


Same I guess?

We could refer Claude to https://huggingface.co/docs/diffusers/main/en/optimization/memory? And briefly discuss a few?

do you think layerwise will change? The statics parts IMO should be on the skill to prevent fetching everything no? If not, this will be just a simple skill that tells the LLMs to just read the docs.

Also the reason I did it separately is because at least claude never suggest using it otherwise, specially because we don't have the when to use text in the docs, we can also add that to the docs as a solution.

I mean layerwise is already in the docs https://huggingface.co/docs/diffusers/main/en/optimization/memory#layerwise-casting.

So, we could introduce Claude to that skill by referring to the docs and perhaps provide a simple intro. WDYT?

sayakpaul · 2026-04-14T11:51:36Z

@@ -0,0 +1,298 @@
+# Memory Calculator


How about using https://github.com/alvarobartt/hf-mem from our very own @alvarobartt?

I'm fine with it but this will add an external dependency/install and also it doesn't take into account RAM usage and other factors like cuda streams etc. I'm fine either way, As a rule I always use the if we can do it without AI, don't use AI, but in this case we also need to calculate other factors that tool doesn't provide, so we can just do it all at the same time.

Okay then let's not use the tool. Do you think we should do our own util, instead and have it under utils? That is better I guess?

sayakpaul · 2026-04-14T11:52:21Z

+  - **Edit/inpainting models**: `A` includes the reference image(s) in addition to the generation activations, so budget extra.
+  - When in doubt, estimate conservatively: `A ≈ 5-8 GB` for typical video workloads, `A ≈ 2-4 GB` for typical image workloads. For high-resolution or long video, increase accordingly.
+
+## Step 2: Compute VRAM and RAM per strategy


This should probably go to a separate offloading.md file?

sayakpaul · 2026-04-14T11:59:01Z

@@ -0,0 +1,180 @@
+# Quantization


Same as above. A lot of it could be offloaded to https://huggingface.co/docs/diffusers/main/en/quantization/overview

sayakpaul · 2026-04-14T12:00:40Z

+quantization_config = PipelineQuantizationConfig(
+    quant_backend="torchao",
+    quant_kwargs={"quant_type": "int8_weight_only"},
+    components_to_quantize=["transformer"],
+)


Let's switch to using something like:

import torch from diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig from torchao.quantization import Int8WeightOnlyConfig pipeline_quant_config = PipelineQuantizationConfig( quant_mapping={"transformer": TorchAoConfig(Int8WeightOnlyConfig(group_size=128, version=2))} ) pipeline = DiffusionPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16, device_map="cuda" )

sayakpaul · 2026-04-14T12:01:45Z

@@ -0,0 +1,213 @@
+# Reduce Memory


We already have thorough usage in https://huggingface.co/docs/diffusers/main/en/optimization/memory. If something needs to be updated, let's update it there. This way, we serve general users of the library as well as the AI agents.

sayakpaul · 2026-04-14T12:03:16Z

@@ -0,0 +1,72 @@
+# torch.compile


Same. However, most of this content should be documented formally actually. Cc: @stevhliu could you help here?

Additionally, we could refer it to follow https://huggingface.co/docs/diffusers/main/en/optimization/speed-memory-optims for squeezing the last bit of performance.

sayakpaul · 2026-04-14T12:05:18Z

@@ -0,0 +1,113 @@
+---
+name: optimizations


Nice. It might be worth including some of the edge cases we have encountered in the past and that way it doesn't run into the same context rot?

sayakpaul · 2026-04-14T12:48:37Z

Don't want to complicate it further but I don't think we sell our cool support for https://huggingface.co/docs/diffusers/main/en/optimization/cache enough.

asomoza · 2026-04-14T12:59:42Z

thanks @sayakpaul I think you're right, if we are going to use the skill here, we can mostly link all the info the model needs to the docs, I'll refactor it and do some tests again.

I was thinking more of a installable skill in my mind.

sayakpaul · 2026-04-14T13:03:42Z

Installable skills also sound good. In that case, we can always refer to the links and assume online access?

initial draft

1dd2004

sayakpaul reviewed Apr 14, 2026

View reviewed changes

Merge branch 'main' into optimizations-skill

b04f3d7

github-actions bot added the size/L PR with diff > 200 LOC label Apr 14, 2026

Conversation

asomoza commented Apr 1, 2026

What does this PR do?

Who can review?

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asomoza Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Apr 14, 2026

Uh oh!

asomoza commented Apr 14, 2026

Uh oh!

sayakpaul commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

asomoza Apr 14, 2026 •

edited

Loading