Conversation
sayakpaul
left a comment
There was a problem hiding this comment.
Thanks a lot!
My main comments are about reusing the existing documentation as much as we can and removing duplicity which might be causing confusion for the agent.
It'd then be nice to include some examples where this skill was useful so that we can build trust amongst users.
| | Backend | Key requirement | Best for | | ||
| |---|---|---| | ||
| | `torch_sdpa` (default) | PyTorch >= 2.0 | General use; auto-selects FlashAttention or memory-efficient kernels | | ||
| | `flash_attention_2` | `flash-attn` package, Ampere+ GPU | Long sequences, training, best raw throughput | | ||
| | `xformers` | `xformers` package | Older GPUs, memory-efficient attention | | ||
| | `flex_attention` | PyTorch >= 2.5 | Custom attention masks, block-sparse patterns | | ||
| | `sage_attention` | `sageattention` package | INT8 quantized attention for inference speed | |
There was a problem hiding this comment.
We should take this information from https://huggingface.co/docs/diffusers/main/en/optimization/attention_backends.
Prefer using flash_hub instead of flash_attention_2, etc.
Should also include _flash_3_hub.
We could also just refer Claude to ./docs/source/en/optimization/attention_backends.md so that it always has the latest info. I think this would be better. WDYT?
There was a problem hiding this comment.
oh yeah, for sure, the attention backends was mostly added by claude itself, since it wasn't my priority for this PR and also I didn't test this part. It for sure can be a link to the docs.
There was a problem hiding this comment.
oh wait, if we just install the diffusers wheel, we don't have the documentation, where are you expecting the model to get the docs? Also fetch online will prevent the skill to be used offline.
I also thought that this makes sense, but only if you have cloned the repo, it doesn't work for just a wheel install.
There was a problem hiding this comment.
Then we should have the users clone a diffusers copy as it's just better and far simpler than to duplicate content.
So, if it's offline, then prompt the users to clone a diffusers copy. If it's not then a fetch operation should suffice.
There was a problem hiding this comment.
But for offline agents, how would they access the skill as these are also not packaged right?
There was a problem hiding this comment.
I now see what you meant, IMO this is more something that we should publish in this repo than here. But yeah, if we assume they will use the skill here, we can just link to the .md local docs.
There was a problem hiding this comment.
Yeah for now, I think it's valuable keep the skills repo-specific. This way, they are more easily discoverable.
But yeah, if we assume they will use the skill here, we can just link to the .md local docs.
That's a fair assumption I guess? How is the skill accessed otherwise then? Can we install it or something?
There was a problem hiding this comment.
yeah, skills can be installed, hf also has a cli installer, you can install skills for a project, for your user or for enterprise. The project ones are the rarest ones and usually for working on that project not for using them.
There was a problem hiding this comment.
Maybe we can ship without having to install first and based on the feedback we can iterate? I don't think things will change too much. WDYT?
| ```python | ||
| # Global default | ||
| from diffusers import set_attention_backend | ||
| set_attention_backend("flash_attention_2") | ||
|
|
||
| # Per-model | ||
| pipe.transformer.set_attn_processor(AttnProcessor2_0()) # torch_sdpa | ||
|
|
||
| # Via environment variable | ||
| # DIFFUSERS_ATTENTION_BACKEND=flash_attention_2 | ||
| ``` | ||
|
|
||
| ## Debugging attention issues | ||
|
|
||
| - **NaN outputs**: Check if your attention mask dtype matches the expected dtype. Some backends require `bool`, others require float masks with `-inf` for masked positions. | ||
| - **Speed regression**: Profile with `torch.profiler` to verify the expected kernel is actually being dispatched. SDPA can silently fall back to the math kernel. | ||
| - **Memory spike**: FlashAttention-2 is memory-efficient for long sequences but has overhead for very short ones. For short sequences, `torch_sdpa` with math fallback may use less memory. | ||
|
|
||
| ## Implementation notes | ||
|
|
||
| - Models integrated into diffusers should use `dispatch_attention_fn` (not `F.scaled_dot_product_attention` directly) so that backend switching works automatically. | ||
| - See the attention pattern in the `model-integration` skill for how to implement this in new models. |
There was a problem hiding this comment.
I think we won't need this if we link to the attention backends documentation?
| @@ -0,0 +1,68 @@ | |||
| # Layerwise Casting | |||
|
|
|||
| ## Overview | |||
There was a problem hiding this comment.
Same I guess?
We could refer Claude to https://huggingface.co/docs/diffusers/main/en/optimization/memory? And briefly discuss a few?
There was a problem hiding this comment.
do you think layerwise will change? The statics parts IMO should be on the skill to prevent fetching everything no? If not, this will be just a simple skill that tells the LLMs to just read the docs.
Also the reason I did it separately is because at least claude never suggest using it otherwise, specially because we don't have the when to use text in the docs, we can also add that to the docs as a solution.
There was a problem hiding this comment.
I mean layerwise is already in the docs https://huggingface.co/docs/diffusers/main/en/optimization/memory#layerwise-casting.
So, we could introduce Claude to that skill by referring to the docs and perhaps provide a simple intro. WDYT?
| @@ -0,0 +1,298 @@ | |||
| # Memory Calculator | |||
There was a problem hiding this comment.
How about using https://github.com/alvarobartt/hf-mem from our very own @alvarobartt?
There was a problem hiding this comment.
I'm fine with it but this will add an external dependency/install and also it doesn't take into account RAM usage and other factors like cuda streams etc. I'm fine either way, As a rule I always use the if we can do it without AI, don't use AI, but in this case we also need to calculate other factors that tool doesn't provide, so we can just do it all at the same time.
There was a problem hiding this comment.
Okay then let's not use the tool. Do you think we should do our own util, instead and have it under utils? That is better I guess?
| - **Edit/inpainting models**: `A` includes the reference image(s) in addition to the generation activations, so budget extra. | ||
| - When in doubt, estimate conservatively: `A ≈ 5-8 GB` for typical video workloads, `A ≈ 2-4 GB` for typical image workloads. For high-resolution or long video, increase accordingly. | ||
|
|
||
| ## Step 2: Compute VRAM and RAM per strategy |
There was a problem hiding this comment.
This should probably go to a separate offloading.md file?
| @@ -0,0 +1,180 @@ | |||
| # Quantization | |||
There was a problem hiding this comment.
Same as above. A lot of it could be offloaded to https://huggingface.co/docs/diffusers/main/en/quantization/overview
| quantization_config = PipelineQuantizationConfig( | ||
| quant_backend="torchao", | ||
| quant_kwargs={"quant_type": "int8_weight_only"}, | ||
| components_to_quantize=["transformer"], | ||
| ) |
There was a problem hiding this comment.
Let's switch to using something like:
import torch
from diffusers import DiffusionPipeline, PipelineQuantizationConfig, TorchAoConfig
from torchao.quantization import Int8WeightOnlyConfig
pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={"transformer": TorchAoConfig(Int8WeightOnlyConfig(group_size=128, version=2))}
)
pipeline = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16,
device_map="cuda"
)| @@ -0,0 +1,213 @@ | |||
| # Reduce Memory | |||
There was a problem hiding this comment.
We already have thorough usage in https://huggingface.co/docs/diffusers/main/en/optimization/memory. If something needs to be updated, let's update it there. This way, we serve general users of the library as well as the AI agents.
| @@ -0,0 +1,72 @@ | |||
| # torch.compile | |||
There was a problem hiding this comment.
Same. However, most of this content should be documented formally actually. Cc: @stevhliu could you help here?
Additionally, we could refer it to follow https://huggingface.co/docs/diffusers/main/en/optimization/speed-memory-optims for squeezing the last bit of performance.
| @@ -0,0 +1,113 @@ | |||
| --- | |||
| name: optimizations | |||
There was a problem hiding this comment.
Nice. It might be worth including some of the edge cases we have encountered in the past and that way it doesn't run into the same context rot?
|
Don't want to complicate it further but I don't think we sell our cool support for https://huggingface.co/docs/diffusers/main/en/optimization/cache enough. |
|
thanks @sayakpaul I think you're right, if we are going to use the skill here, we can mostly link all the info the model needs to the docs, I'll refactor it and do some tests again. I was thinking more of a installable skill in my mind. |
|
Installable skills also sound good. In that case, we can always refer to the links and assume online access? |
What does this PR do?
Add a skill for all the optimizations that can be done with diffusers.
At the moment it has a lot of information and still with this the LLM (Opus 4.6) is still a hit or miss with the optimizations, specially with windows that doesn't give OOMs.
It does make the pipelines run though which is the purpose, just not with the best combination IMO.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sayakpaul