Skip to content

[Agents] Optimizations skill#13381

Open
asomoza wants to merge 2 commits intomainfrom
optimizations-skill
Open

[Agents] Optimizations skill#13381
asomoza wants to merge 2 commits intomainfrom
optimizations-skill

Conversation

@asomoza
Copy link
Copy Markdown
Member

@asomoza asomoza commented Apr 1, 2026

What does this PR do?

Add a skill for all the optimizations that can be done with diffusers.

At the moment it has a lot of information and still with this the LLM (Opus 4.6) is still a hit or miss with the optimizations, specially with windows that doesn't give OOMs.

It does make the pipelines run though which is the purpose, just not with the best combination IMO.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul

Copy link
Copy Markdown
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

My main comments are about reusing the existing documentation as much as we can and removing duplicity which might be causing confusion for the agent.

It'd then be nice to include some examples where this skill was useful so that we can build trust amongst users.

Comment on lines +9 to +15
| Backend | Key requirement | Best for |
|---|---|---|
| `torch_sdpa` (default) | PyTorch >= 2.0 | General use; auto-selects FlashAttention or memory-efficient kernels |
| `flash_attention_2` | `flash-attn` package, Ampere+ GPU | Long sequences, training, best raw throughput |
| `xformers` | `xformers` package | Older GPUs, memory-efficient attention |
| `flex_attention` | PyTorch >= 2.5 | Custom attention masks, block-sparse patterns |
| `sage_attention` | `sageattention` package | INT8 quantized attention for inference speed |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should take this information from https://huggingface.co/docs/diffusers/main/en/optimization/attention_backends.

Prefer using flash_hub instead of flash_attention_2, etc.

Should also include _flash_3_hub.

We could also just refer Claude to ./docs/source/en/optimization/attention_backends.md so that it always has the latest info. I think this would be better. WDYT?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah, for sure, the attention backends was mostly added by claude itself, since it wasn't my priority for this PR and also I didn't test this part. It for sure can be a link to the docs.

Copy link
Copy Markdown
Member Author

@asomoza asomoza Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, if we just install the diffusers wheel, we don't have the documentation, where are you expecting the model to get the docs? Also fetch online will prevent the skill to be used offline.

I also thought that this makes sense, but only if you have cloned the repo, it doesn't work for just a wheel install.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should have the users clone a diffusers copy as it's just better and far simpler than to duplicate content.

So, if it's offline, then prompt the users to clone a diffusers copy. If it's not then a fetch operation should suffice.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for offline agents, how would they access the skill as these are also not packaged right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now see what you meant, IMO this is more something that we should publish in this repo than here. But yeah, if we assume they will use the skill here, we can just link to the .md local docs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for now, I think it's valuable keep the skills repo-specific. This way, they are more easily discoverable.

But yeah, if we assume they will use the skill here, we can just link to the .md local docs.

That's a fair assumption I guess? How is the skill accessed otherwise then? Can we install it or something?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, skills can be installed, hf also has a cli installer, you can install skills for a project, for your user or for enterprise. The project ones are the rarest ones and usually for working on that project not for using them.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can ship without having to install first and based on the feedback we can iterate? I don't think things will change too much. WDYT?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, lets go with that

Comment on lines +19 to +40
```python
# Global default
from diffusers import set_attention_backend
set_attention_backend("flash_attention_2")

# Per-model
pipe.transformer.set_attn_processor(AttnProcessor2_0()) # torch_sdpa

# Via environment variable
# DIFFUSERS_ATTENTION_BACKEND=flash_attention_2
```

## Debugging attention issues

- **NaN outputs**: Check if your attention mask dtype matches the expected dtype. Some backends require `bool`, others require float masks with `-inf` for masked positions.
- **Speed regression**: Profile with `torch.profiler` to verify the expected kernel is actually being dispatched. SDPA can silently fall back to the math kernel.
- **Memory spike**: FlashAttention-2 is memory-efficient for long sequences but has overhead for very short ones. For short sequences, `torch_sdpa` with math fallback may use less memory.

## Implementation notes

- Models integrated into diffusers should use `dispatch_attention_fn` (not `F.scaled_dot_product_attention` directly) so that backend switching works automatically.
- See the attention pattern in the `model-integration` skill for how to implement this in new models.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we won't need this if we link to the attention backends documentation?

@@ -0,0 +1,68 @@
# Layerwise Casting

## Overview
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same I guess?

We could refer Claude to https://huggingface.co/docs/diffusers/main/en/optimization/memory? And briefly discuss a few?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think layerwise will change? The statics parts IMO should be on the skill to prevent fetching everything no? If not, this will be just a simple skill that tells the LLMs to just read the docs.

Also the reason I did it separately is because at least claude never suggest using it otherwise, specially because we don't have the when to use text in the docs, we can also add that to the docs as a solution.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean layerwise is already in the docs https://huggingface.co/docs/diffusers/main/en/optimization/memory#layerwise-casting.

So, we could introduce Claude to that skill by referring to the docs and perhaps provide a simple intro. WDYT?

@@ -0,0 +1,298 @@
# Memory Calculator
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using https://github.com/alvarobartt/hf-mem from our very own @alvarobartt?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with it but this will add an external dependency/install and also it doesn't take into account RAM usage and other factors like cuda streams etc. I'm fine either way, As a rule I always use the if we can do it without AI, don't use AI, but in this case we also need to calculate other factors that tool doesn't provide, so we can just do it all at the same time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay then let's not use the tool. Do you think we should do our own util, instead and have it under utils? That is better I guess?

- **Edit/inpainting models**: `A` includes the reference image(s) in addition to the generation activations, so budget extra.
- When in doubt, estimate conservatively: `A ≈ 5-8 GB` for typical video workloads, `A ≈ 2-4 GB` for typical image workloads. For high-resolution or long video, increase accordingly.

## Step 2: Compute VRAM and RAM per strategy
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably go to a separate offloading.md file?

@@ -0,0 +1,180 @@
# Quantization
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. A lot of it could be offloaded to https://huggingface.co/docs/diffusers/main/en/quantization/overview

Comment on lines +73 to +77
quantization_config = PipelineQuantizationConfig(
quant_backend="torchao",
quant_kwargs={"quant_type": "int8_weight_only"},
components_to_quantize=["transformer"],
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's switch to using something like:

import torch
from diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig
from torchao.quantization import Int8WeightOnlyConfig

pipeline_quant_config = PipelineQuantizationConfig(
    quant_mapping={"transformer": TorchAoConfig(Int8WeightOnlyConfig(group_size=128, version=2))}
)
pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

@@ -0,0 +1,213 @@
# Reduce Memory
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have thorough usage in https://huggingface.co/docs/diffusers/main/en/optimization/memory. If something needs to be updated, let's update it there. This way, we serve general users of the library as well as the AI agents.

@@ -0,0 +1,72 @@
# torch.compile
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. However, most of this content should be documented formally actually. Cc: @stevhliu could you help here?

Additionally, we could refer it to follow https://huggingface.co/docs/diffusers/main/en/optimization/speed-memory-optims for squeezing the last bit of performance.

@@ -0,0 +1,113 @@
---
name: optimizations
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. It might be worth including some of the edge cases we have encountered in the past and that way it doesn't run into the same context rot?

@github-actions github-actions bot added the size/L PR with diff > 200 LOC label Apr 14, 2026
@sayakpaul
Copy link
Copy Markdown
Member

Don't want to complicate it further but I don't think we sell our cool support for https://huggingface.co/docs/diffusers/main/en/optimization/cache enough.

@asomoza
Copy link
Copy Markdown
Member Author

asomoza commented Apr 14, 2026

thanks @sayakpaul I think you're right, if we are going to use the skill here, we can mostly link all the info the model needs to the docs, I'll refactor it and do some tests again.

I was thinking more of a installable skill in my mind.

@sayakpaul
Copy link
Copy Markdown
Member

Installable skills also sound good. In that case, we can always refer to the links and assume online access?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L PR with diff > 200 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants