Skip to content

Commit dc5cd04

Browse files
authored
Merge branch 'main' into overhaul-release-workflow
2 parents 6194eac + 1fe2125 commit dc5cd04

37 files changed

Lines changed: 3583 additions & 720 deletions

.ai/review-rules.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# PR Review Rules
2+
3+
Review-specific rules for Claude. Focus on correctness — style is handled by ruff.
4+
5+
Before reviewing, read and apply the guidelines in:
6+
- [AGENTS.md](AGENTS.md) — coding style, dependencies, copied code, model conventions
7+
- [skills/model-integration/SKILL.md](skills/model-integration/SKILL.md) — attention pattern, pipeline rules, implementation checklist, gotchas
8+
- [skills/parity-testing/SKILL.md](skills/parity-testing/SKILL.md) — testing rules, comparison utilities
9+
- [skills/parity-testing/pitfalls.md](skills/parity-testing/pitfalls.md) — known pitfalls (dtype mismatches, config assumptions, etc.)
10+
11+
## Common mistakes (add new rules below this line)
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
name: Claude PR Review
2+
3+
on:
4+
issue_comment:
5+
types: [created]
6+
pull_request_review_comment:
7+
types: [created]
8+
9+
permissions:
10+
contents: write
11+
pull-requests: write
12+
issues: read
13+
id-token: write
14+
15+
jobs:
16+
claude-review:
17+
if: |
18+
(
19+
github.event_name == 'issue_comment' &&
20+
github.event.issue.pull_request &&
21+
github.event.issue.state == 'open' &&
22+
contains(github.event.comment.body, '@claude') &&
23+
(github.event.comment.author_association == 'MEMBER' ||
24+
github.event.comment.author_association == 'OWNER' ||
25+
github.event.comment.author_association == 'COLLABORATOR')
26+
) || (
27+
github.event_name == 'pull_request_review_comment' &&
28+
contains(github.event.comment.body, '@claude') &&
29+
(github.event.comment.author_association == 'MEMBER' ||
30+
github.event.comment.author_association == 'OWNER' ||
31+
github.event.comment.author_association == 'COLLABORATOR')
32+
)
33+
runs-on: ubuntu-latest
34+
steps:
35+
- uses: anthropics/claude-code-action@v1
36+
with:
37+
anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}
38+
claude_args: |
39+
--append-system-prompt "Review this PR against the rules in .ai/review-rules.md. Focus on correctness, not style (ruff handles style). Only review changes under src/diffusers/. Do NOT commit changes unless the comment explicitly asks you to using the phrase 'commit this'."

docs/source/en/_toctree.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -670,6 +670,10 @@
670670
- local: api/pipelines/z_image
671671
title: Z-Image
672672
title: Image
673+
- sections:
674+
- local: api/pipelines/llada2
675+
title: LLaDA2
676+
title: Text
673677
- sections:
674678
- local: api/pipelines/allegro
675679
title: Allegro
@@ -718,6 +722,8 @@
718722
- sections:
719723
- local: api/schedulers/overview
720724
title: Overview
725+
- local: api/schedulers/block_refinement
726+
title: BlockRefinementScheduler
721727
- local: api/schedulers/cm_stochastic_iterative
722728
title: CMStochasticIterativeScheduler
723729
- local: api/schedulers/ddim_cogvideox

docs/source/en/api/pipelines/cogvideox.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,16 +41,15 @@ The quantized CogVideoX 5B model below requires ~16GB of VRAM.
4141

4242
```py
4343
import torch
44-
from diffusers import CogVideoXPipeline, AutoModel
44+
from diffusers import CogVideoXPipeline, AutoModel, TorchAoConfig
4545
from diffusers.quantizers import PipelineQuantizationConfig
4646
from diffusers.hooks import apply_group_offloading
4747
from diffusers.utils import export_to_video
48+
from torchao.quantization import Int8WeightOnlyConfig
4849

4950
# quantize weights to int8 with torchao
5051
pipeline_quant_config = PipelineQuantizationConfig(
51-
quant_backend="torchao",
52-
quant_kwargs={"quant_type": "int8wo"},
53-
components_to_quantize="transformer"
52+
quant_mapping={"transformer": TorchAoConfig(Int8WeightOnlyConfig())}
5453
)
5554

5655
# fp8 layerwise weight-casting
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# LLaDA2
14+
15+
[LLaDA2](https://huggingface.co/collections/inclusionAI/llada21) is a family of discrete diffusion language models
16+
that generate text through block-wise iterative refinement. Instead of autoregressive token-by-token generation,
17+
LLaDA2 starts with a fully masked sequence and progressively unmasks tokens by confidence over multiple refinement
18+
steps.
19+
20+
## Usage
21+
22+
```py
23+
import torch
24+
from transformers import AutoModelForCausalLM, AutoTokenizer
25+
26+
from diffusers import BlockRefinementScheduler, LLaDA2Pipeline
27+
28+
model_id = "inclusionAI/LLaDA2.1-mini"
29+
model = AutoModelForCausalLM.from_pretrained(
30+
model_id, trust_remote_code=True, dtype=torch.bfloat16, device_map="auto"
31+
)
32+
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
33+
scheduler = BlockRefinementScheduler()
34+
35+
pipe = LLaDA2Pipeline(model=model, scheduler=scheduler, tokenizer=tokenizer)
36+
output = pipe(
37+
prompt="Write a short poem about the ocean.",
38+
gen_length=256,
39+
block_length=32,
40+
num_inference_steps=32,
41+
threshold=0.7,
42+
editing_threshold=0.5,
43+
max_post_steps=16,
44+
temperature=0.0,
45+
)
46+
print(output.texts[0])
47+
```
48+
49+
## Callbacks
50+
51+
Callbacks run after each refinement step. Pass `callback_on_step_end_tensor_inputs` to select which tensors are
52+
included in `callback_kwargs`. In the current implementation, `block_x` (the sequence window being refined) and
53+
`transfer_index` (mask-filling commit mask) are provided; return `{"block_x": ...}` from the callback to replace the
54+
window.
55+
56+
```py
57+
def on_step_end(pipe, step, timestep, callback_kwargs):
58+
block_x = callback_kwargs["block_x"]
59+
# Inspect or modify `block_x` here.
60+
return {"block_x": block_x}
61+
62+
out = pipe(
63+
prompt="Write a short poem.",
64+
callback_on_step_end=on_step_end,
65+
callback_on_step_end_tensor_inputs=["block_x"],
66+
)
67+
```
68+
69+
## Recommended parameters
70+
71+
LLaDA2.1 models support two modes:
72+
73+
| Mode | `threshold` | `editing_threshold` | `max_post_steps` |
74+
|------|-------------|---------------------|------------------|
75+
| Quality | 0.7 | 0.5 | 16 |
76+
| Speed | 0.5 | `None` | 16 |
77+
78+
Pass `editing_threshold=None`, `0.0`, or a negative value to turn off post-mask editing.
79+
80+
For LLaDA2.0 models, disable editing by passing `editing_threshold=None` or `0.0`.
81+
82+
For all models: `block_length=32`, `temperature=0.0`, `num_inference_steps=32`.
83+
84+
## LLaDA2Pipeline
85+
[[autodoc]] LLaDA2Pipeline
86+
- all
87+
- __call__
88+
89+
## LLaDA2PipelineOutput
90+
[[autodoc]] pipelines.LLaDA2PipelineOutput

docs/source/en/api/pipelines/ltx2.md

Lines changed: 151 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
1919
</div>
2020

21-
LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
21+
[LTX-2](https://hf.co/papers/2601.03233) is a DiT-based foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
2222

2323
You can find all the original LTX-Video checkpoints under the [Lightricks](https://huggingface.co/Lightricks) organization.
2424

@@ -293,6 +293,7 @@ import torch
293293
from diffusers import LTX2ConditionPipeline
294294
from diffusers.pipelines.ltx2.pipeline_ltx2_condition import LTX2VideoCondition
295295
from diffusers.pipelines.ltx2.export_utils import encode_video
296+
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT
296297
from diffusers.utils import load_image, load_video
297298

298299
device = "cuda"
@@ -315,19 +316,6 @@ prompt = (
315316
"landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the "
316317
"solitude and beauty of a winter drive through a mountainous region."
317318
)
318-
negative_prompt = (
319-
"blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
320-
"grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
321-
"deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
322-
"wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
323-
"field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
324-
"lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
325-
"valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
326-
"mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
327-
"off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
328-
"pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
329-
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
330-
)
331319

332320
cond_video = load_video(
333321
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
@@ -343,7 +331,7 @@ frame_rate = 24.0
343331
video, audio = pipe(
344332
conditions=conditions,
345333
prompt=prompt,
346-
negative_prompt=negative_prompt,
334+
negative_prompt=DEFAULT_NEGATIVE_PROMPT,
347335
width=width,
348336
height=height,
349337
num_frames=121,
@@ -366,6 +354,154 @@ encode_video(
366354

367355
Because the conditioning is done via latent frames, the 8 data space frames corresponding to the specified latent frame for an image condition will tend to be static.
368356

357+
## Multimodal Guidance
358+
359+
LTX-2.X pipelines support multimodal guidance. It is composed of three terms, all using a CFG-style update rule:
360+
361+
1. Classifier-Free Guidance (CFG): standard [CFG](https://huggingface.co/papers/2207.12598) where the perturbed ("weaker") output is generated using the negative prompt.
362+
2. Spatio-Temporal Guidance (STG): [STG](https://huggingface.co/papers/2411.18664) moves away from a perturbed output created from short-cutting self-attention operations and substitutes in the attention values instead. The idea is that this creates sharper videos and better spatiotemporal consistency.
363+
3. Modality Isolation Guidance: moves away from a perturbed output created from disabling cross-modality (audio-to-video and video-to-audio) cross attention. This guidance is more specific to [LTX-2.X](https://huggingface.co/papers/2601.03233) models, with the idea that this produces better consistency between the generated audio and video.
364+
365+
These are controlled by the `guidance_scale`, `stg_scale`, and `modality_scale` arguments and can be set separately for video and audio. Additionally, for STG the transformer block indices where self-attention is skipped needs to be specified via the `spatio_temporal_guidance_blocks` argument. The LTX-2.X pipelines also support [guidance rescaling](https://huggingface.co/papers/2305.08891) to help reduce over-exposure, which can be a problem when the guidance scales are set to high values.
366+
367+
```py
368+
import torch
369+
from diffusers import LTX2ImageToVideoPipeline
370+
from diffusers.pipelines.ltx2.export_utils import encode_video
371+
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT
372+
from diffusers.utils import load_image
373+
374+
device = "cuda"
375+
width = 768
376+
height = 512
377+
random_seed = 42
378+
frame_rate = 24.0
379+
generator = torch.Generator(device).manual_seed(random_seed)
380+
model_path = "dg845/LTX-2.3-Diffusers"
381+
382+
pipe = LTX2ImageToVideoPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
383+
pipe.enable_sequential_cpu_offload(device=device)
384+
pipe.vae.enable_tiling()
385+
386+
prompt = (
387+
"An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in "
388+
"gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs "
389+
"before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small "
390+
"fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly "
391+
"shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a "
392+
"smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the "
393+
"distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a "
394+
"breath-taking, movie-like shot."
395+
)
396+
397+
image = load_image(
398+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg",
399+
)
400+
401+
video, audio = pipe(
402+
image=image,
403+
prompt=prompt,
404+
negative_prompt=DEFAULT_NEGATIVE_PROMPT,
405+
width=width,
406+
height=height,
407+
num_frames=121,
408+
frame_rate=frame_rate,
409+
num_inference_steps=30,
410+
guidance_scale=3.0, # Recommended LTX-2.3 guidance parameters
411+
stg_scale=1.0, # Note that 0.0 (not 1.0) means that STG is disabled (all other guidance is disabled at 1.0)
412+
modality_scale=3.0,
413+
guidance_rescale=0.7,
414+
audio_guidance_scale=7.0, # Note that a higher CFG guidance scale is recommended for audio
415+
audio_stg_scale=1.0,
416+
audio_modality_scale=3.0,
417+
audio_guidance_rescale=0.7,
418+
spatio_temporal_guidance_blocks=[28],
419+
use_cross_timestep=True,
420+
generator=generator,
421+
output_type="np",
422+
return_dict=False,
423+
)
424+
425+
encode_video(
426+
video[0],
427+
fps=frame_rate,
428+
audio=audio[0].float().cpu(),
429+
audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
430+
output_path="ltx2_3_i2v_stage_1.mp4",
431+
)
432+
```
433+
434+
## Prompt Enhancement
435+
436+
The LTX-2.X models are sensitive to prompting style. Refer to the [official prompting guide](https://ltx.io/model/model-blog/prompting-guide-for-ltx-2) for recommendations on how to write a good prompt. Using prompt enhancement, where the supplied prompts are enhanced using the pipeline's text encoder (by default a [Gemma 3](https://huggingface.co/google/gemma-3-12b-it-qat-q4_0-unquantized) model) given a system prompt, can also improve sample quality. The optional `processor` pipeline component needs to be present to use prompt enhancement. Enable prompt enhancement by supplying a `system_prompt` argument:
437+
438+
439+
```py
440+
import torch
441+
from transformers import Gemma3Processor
442+
from diffusers import LTX2Pipeline
443+
from diffusers.pipelines.ltx2.export_utils import encode_video
444+
from diffusers.pipelines.ltx2.utils import DEFAULT_NEGATIVE_PROMPT, T2V_DEFAULT_SYSTEM_PROMPT
445+
446+
device = "cuda"
447+
width = 768
448+
height = 512
449+
random_seed = 42
450+
frame_rate = 24.0
451+
generator = torch.Generator(device).manual_seed(random_seed)
452+
model_path = "dg845/LTX-2.3-Diffusers"
453+
454+
pipe = LTX2Pipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
455+
pipe.enable_model_cpu_offload(device=device)
456+
pipe.vae.enable_tiling()
457+
if getattr(pipe, "processor", None) is None:
458+
processor = Gemma3Processor.from_pretrained("google/gemma-3-12b-it-qat-q4_0-unquantized")
459+
pipe.processor = processor
460+
461+
prompt = (
462+
"An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in "
463+
"gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs "
464+
"before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small "
465+
"fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly "
466+
"shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a "
467+
"smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the "
468+
"distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a "
469+
"breath-taking, movie-like shot."
470+
)
471+
472+
video, audio = pipe(
473+
prompt=prompt,
474+
negative_prompt=DEFAULT_NEGATIVE_PROMPT,
475+
width=width,
476+
height=height,
477+
num_frames=121,
478+
frame_rate=frame_rate,
479+
num_inference_steps=30,
480+
guidance_scale=3.0,
481+
stg_scale=1.0,
482+
modality_scale=3.0,
483+
guidance_rescale=0.7,
484+
audio_guidance_scale=7.0,
485+
audio_stg_scale=1.0,
486+
audio_modality_scale=3.0,
487+
audio_guidance_rescale=0.7,
488+
spatio_temporal_guidance_blocks=[28],
489+
use_cross_timestep=True,
490+
system_prompt=T2V_DEFAULT_SYSTEM_PROMPT,
491+
generator=generator,
492+
output_type="np",
493+
return_dict=False,
494+
)
495+
496+
encode_video(
497+
video[0],
498+
fps=frame_rate,
499+
audio=audio[0].float().cpu(),
500+
audio_sample_rate=pipe.vocoder.config.output_sampling_rate,
501+
output_path="ltx2_3_t2v_stage_1.mp4",
502+
)
503+
```
504+
369505
## LTX2Pipeline
370506

371507
[[autodoc]] LTX2Pipeline

docs/source/en/api/pipelines/overview.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ The table below lists all the pipelines currently available in 🤗 Diffusers an
6363
| [Latent Diffusion](latent_diffusion) | text2image, super-resolution |
6464
| [Latte](latte) | text2image |
6565
| [LEDITS++](ledits_pp) | image editing |
66+
| [LLaDA2](llada2) | text2text |
6667
| [Lumina-T2X](lumina) | text2image |
6768
| [Marigold](marigold) | depth-estimation, normals-estimation, intrinsic-decomposition |
6869
| [MultiDiffusion](panorama) | text2image |

0 commit comments

Comments
 (0)