Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 9 additions & 6 deletions docs/source/en/api/models/anyflow_far_transformer3d.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,22 @@ specific language governing permissions and limitations under the License.
# AnyFlowFARTransformer3DModel

The causal (FAR) 3D Transformer used by [`AnyFlowFARPipeline`](../pipelines/anyflow#anyflowfarpipeline) —
the FAR variant of [AnyFlow](https://huggingface.co/papers/2605.13724) (Yuchao Gu, Guian Fang et al., NUS
ShowLab × NVIDIA). It extends the v0.35.1 Wan2.1 backbone with three additions:
the FAR variant of [AnyFlow](https://huggingface.co/papers/2605.13724). See the
[`AnyFlowFARPipeline`](../pipelines/anyflow) page for paper, authors, and released checkpoints. It extends
the v0.35.1 Wan2.1 backbone with three additions:

1. **FAR causal block-mask** via `torch.nn.attention.flex_attention`, supporting frame-level autoregressive
generation as introduced in [FAR (Gu et al., 2025)](https://arxiv.org/abs/2503.19325).
1. **FAR causal block-mask** via `torch.nn.attention.flex_attention`, supporting chunk-wise autoregressive
generation as introduced in [FAR](https://huggingface.co/papers/2503.19325).
2. **Compressed-frame patch embedding** (`far_patch_embedding`) for context (already-generated) frames,
warm-started from the full-resolution `patch_embedding` at construction time via trilinear interpolation.
3. **Dual-timestep flow-map embedding** (same as
[`AnyFlowTransformer3DModel`](anyflow_transformer3d)) — every forward call conditions on both the source
timestep ``t`` and the target timestep ``r``.

The chunk schedule (`chunk_partition`) is **not** baked into the model config. It is a per-call argument to
`forward`, so the same checkpoint handles different `num_frames` configurations without retraining.
The default chunk schedule (`chunk_partition`) is stored in the model config; the released NVIDIA AnyFlow-FAR
checkpoints use `[1, 3, 3, 3, 3, 3, 3, 2]` for the canonical 81-frame setting. `forward` accepts a per-call
`chunk_partition` override, so the same checkpoint also handles other `num_frames` configurations without
retraining.

```python
from diffusers import AnyFlowFARTransformer3DModel
Expand Down
7 changes: 4 additions & 3 deletions docs/source/en/api/models/anyflow_transformer3d.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,11 @@ The bidirectional 3D Transformer used by [`AnyFlowPipeline`](../pipelines/anyflo
v0.35.1 Wan2.1 backbone with one structural change: the timestep embedder is replaced by
``AnyFlowDualTimestepTextImageEmbedding``, so every forward call conditions on both the source timestep
``t`` and the target timestep ``r``. This is the embedding required to learn the flow map
:math:`\Phi_{r\leftarrow t}` introduced in
[AnyFlow](https://huggingface.co/papers/2605.13724) (Yuchao Gu, Guian Fang et al., NUS ShowLab × NVIDIA).
$\Phi_{r\leftarrow t}$ introduced in
[AnyFlow](https://huggingface.co/papers/2605.13724). See the [`AnyFlowPipeline`](../pipelines/anyflow) page
for paper, authors, and released checkpoints.

For frame-level autoregressive (FAR causal) generation, use
For chunk-wise autoregressive (FAR causal) generation, use
[`AnyFlowFARTransformer3DModel`](anyflow_far_transformer3d) instead.

```python
Expand Down
124 changes: 49 additions & 75 deletions docs/source/en/api/pipelines/anyflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,68 +20,28 @@ specific language governing permissions and limitations under the License.

# AnyFlow

[AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation](https://huggingface.co/papers/2605.13724) by Yuchao Gu, Guian Fang and collaborators at [NUS ShowLab](https://sites.google.com/view/showlab) in collaboration with NVIDIA.
[AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation](https://huggingface.co/papers/2605.13724) from NVIDIA, National University of Singapore, and Massachusetts Institute of Technology, by Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai, Mike Zheng Shou.

> **TL;DR:** AnyFlow is the first any-step video diffusion framework built on flow maps, which enables a single model (bidirectional or causal) to adapt to arbitrary inference budgets.

*Few-step video generation has been significantly advanced by consistency models. However, their performance often degrades in any-step video diffusion models due to the fixed-point formulation. To address this limitation, we present AnyFlow, the first any-step video diffusion distillation framework built on flow maps. Instead of learning only the mapping z_t → z_0, AnyFlow learns transitions z_t → z_r over arbitrary time intervals, enabling a single model to adapt to different inference budgets. We design an improved forward flow map training recipe that fine-tunes pretrained video diffusion models into flow map models, and introduce Flow Map Backward Simulation to enable on-policy distillation for flow map models. Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B, on text-to-video and image-to-video tasks demonstrate that AnyFlow outperforms consistency-based baselines while preserving high fidelity and flexible sampling under varying step budgets.*

The original training code is at [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow). The project page is at [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow).
The AnyFlow pipelines were contributed by the AnyFlow Team. The original code is available on [GitHub](https://github.com/NVlabs/AnyFlow), the project page is at [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow), and pretrained models can be found in the [nvidia/anyflow](https://huggingface.co/collections/nvidia/anyflow) collection on Hugging Face.

The following AnyFlow checkpoints are supported:
Available Models:

| Checkpoint | Backbone | Description |
|------------|----------|-------------|
| [`nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers) | Wan2.1 1.3B | Bidirectional T2V, lightweight |
| [`nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers) | Wan2.1 14B | Bidirectional T2V, full quality |
|---|---|---|
| [`nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers) | Wan2.1 1.3B | Bidirectional T2V |
| [`nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers) | Wan2.1 14B | Bidirectional T2V |
| [`nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers) | FAR + Wan2.1 1.3B | Causal T2V / I2V / V2V |
| [`nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers) | FAR + Wan2.1 14B | Causal T2V / I2V / V2V |

All four are grouped under the [`nvidia/anyflow`](https://huggingface.co/collections/nvidia/anyflow) Hugging Face collection.

> [!TIP]
> Choose `AnyFlowPipeline` for traditional bidirectional text-to-video generation. Choose `AnyFlowFARPipeline` for streaming I2V, video continuation (V2V), or any setup that benefits from frame-by-frame autoregressive sampling.

> [!TIP]
> AnyFlow supports any-step sampling: a single distilled checkpoint can be evaluated at 1, 2, 4, 8, 16... NFE without retraining. Quality scales monotonically with steps in our benchmarks.

### Optimizing Memory and Inference Speed

<hfoptions id="optimization">
<hfoption id="memory">

```py
import torch
from diffusers import AnyFlowPipeline
from diffusers.hooks import apply_group_offloading

pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
)
apply_group_offloading(pipe.transformer, onload_device="cuda", offload_type="leaf_level")
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
```

</hfoption>
<hfoption id="inference speed">

```py
import torch
from diffusers import AnyFlowPipeline

pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")
```

</hfoption>
</hfoptions>
> `AnyFlowPipeline` is designed for bidirectional diffusion models in text-to-video (T2V) generation. `AnyFlowFARPipeline` is a chunk-wise causal diffusion model that supports text-to-video (T2V) generation, image-to-video (I2V) generation, and video continuation (V2V).

### Generation with AnyFlow (Bidirectional T2V)

<hfoptions id="anyflow-bidi">
<hfoption id="usage">

```py
import torch
from diffusers import AnyFlowPipeline
Expand All @@ -91,14 +51,16 @@ pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

prompt = "A red panda eating bamboo in a forest, cinematic lighting"
video = pipe(prompt, num_inference_steps=4, num_frames=33).frames[0]
export_to_video(video, "out.mp4", fps=16)
prompt = (
"An astronaut runs smoothly and appears almost weightless on the lunar surface, "
"as seen from a low-angle shot that highlights the vast, desolate background of the moon. "
"The moon's craters and rocky terrain are clearly visible, creating a stark contrast against "
"the running astronaut who moves with graceful, fluid motions."
)
video = pipe(prompt, num_inference_steps=4, num_frames=81).frames[0]
export_to_video(video, "anyflow_t2v.mp4", fps=16)
```

</hfoption>
</hfoptions>

### Generation with AnyFlow (FAR Causal)

The causal pipeline selects between T2V / I2V / V2V via the ``video`` (or ``video_latents``) argument:
Expand All @@ -108,10 +70,10 @@ clip for V2V continuation. If you already have pre-encoded latents in the model
``video_latents=<tensor>`` to skip VAE encoding. ``video`` and ``video_latents`` are mutually exclusive.

> [!IMPORTANT]
> `AnyFlowFARPipeline.default_chunk_partition = [1, 3, 3, 3, 3, 3, 3, 2]` (sum 21) is matched to the
> released checkpoints' canonical 81 raw frames (21 latent frames at the VAE temporal stride of 4). When
> you change `num_frames`, you must also pass a matching `chunk_partition` summing to
> `(num_frames - 1) // 4 + 1`, otherwise the pipeline raises an `AssertionError`.
> The released checkpoints bake `chunk_partition=[1, 3, 3, 3, 3, 3, 3, 2]` (sum 21) into the transformer
> config, matched to the canonical 81 raw frames (21 latent frames at the VAE temporal stride of 4). When
> you change `num_frames`, pass a matching `chunk_partition` summing to `(num_frames - 1) // 4 + 1`,
> otherwise the pipeline raises a `ValueError`.

<hfoptions id="anyflow-far">
<hfoption id="t2v">
Expand All @@ -125,12 +87,12 @@ pipe = AnyFlowFARPipeline.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

video = pipe(
prompt="A cat surfing a wave, sunset",
num_inference_steps=4,
num_frames=81,
).frames[0]
export_to_video(video, "out.mp4", fps=16)
prompt = (
"An astronaut runs smoothly and appears almost weightless on the lunar surface, "
"as seen from a low-angle shot that highlights the vast, desolate background of the moon."
)
video = pipe(prompt, num_inference_steps=4, num_frames=81).frames[0]
export_to_video(video, "anyflow_far_t2v.mp4", fps=16)
```

</hfoption>
Expand All @@ -146,18 +108,25 @@ pipe = AnyFlowFARPipeline.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

# Wrap the conditioning image as a one-frame video tensor: (1, 1, 3, H, W) in [0, 1].
first_frame = load_image("path/to/first_frame.png").resize((832, 480))
# Example conditioning image from the AnyFlow repo.
first_frame = load_image(
"https://raw.githubusercontent.com/NVlabs/AnyFlow/main/assets/evaluation/example/images/1.jpg"
).resize((832, 480))
arr = np.asarray(first_frame).astype("float32") / 255.0 # (480, 832, 3)
context_tensor = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(1).to("cuda")
context_tensor = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(1).to("cuda") # (1, 1, 3, 480, 832)

prompt = (
"A towering, battle-scarred humanoid robot, reminiscent of a Transformer with powerful, segmented armor "
"and glowing red optics, walking through the skeletal remains of a city ruin. Twisted metal and shattered "
"concrete crunch under its heavy steps, as the robot scans the desolate, dust-choked skyline under an dark sky."
)
video = pipe(
prompt="a cat walks across a sunlit lawn",
prompt=prompt,
video=context_tensor,
num_inference_steps=4,
num_frames=81,
).frames[0]
export_to_video(video, "out.mp4", fps=16)
export_to_video(video, "anyflow_far_i2v.mp4", fps=16)
```

</hfoption>
Expand All @@ -173,21 +142,26 @@ pipe = AnyFlowFARPipeline.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

# Context clip — 9 raw frames map to 3 latent frames (9 = 4·2 + 1, 3 = 2 + 1).
context_frames = load_video("path/to/context.mp4")[:9]
# Example conditioning clip from the AnyFlow repo — take the first 9 frames (3 latent frames at VAE temporal stride 4).
context_frames = load_video(
"https://raw.githubusercontent.com/NVlabs/AnyFlow/main/assets/evaluation/example/videos/2.mp4"
)[:9]
arr = np.stack([np.asarray(f.resize((832, 480))) for f in context_frames]).astype("float32") / 255.0
# np.stack gives (T, H, W, C) = (9, 480, 832, 3) → permute to (T, C, H, W) then add batch.
context_tensor = torch.from_numpy(arr).permute(0, 3, 1, 2).unsqueeze(0).to("cuda") # (1, 9, 3, 480, 832)

prompt = (
"A focused trail runner's powerful strides through a dense, sun-dappled forest. "
"The camera tracks alongside, highlighting muscular exertion, sweat, and determined facial expression."
)
video = pipe(
prompt="continue the story",
prompt=prompt,
video=context_tensor,
num_inference_steps=4,
num_frames=81,
# Override chunk_partition so the first chunk covers exactly the 3 latent context frames.
chunk_partition=[3, 3, 3, 3, 3, 3, 3],
).frames[0]
export_to_video(video, "out.mp4", fps=16)
export_to_video(video, "anyflow_far_v2v.mp4", fps=16)
```

</hfoption>
Expand Down
41 changes: 7 additions & 34 deletions docs/source/zh/using-diffusers/anyflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ NFE 增加反而经常掉点。
采样步之间的 re-noising;on-policy 蒸馏阶段额外用 **DMD 反向散度监督** + **Flow-Map backward simulation**
(3 段 shortcut)补上 consistency 蒸馏遗留的 exposure-bias 缺口。

AnyFlow 由 Yuchao Gu、Guian Fang 等人在 [NUS ShowLab](https://sites.google.com/view/showlab) 与 NVIDIA 合作完成。原始训练代码在 [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow),项目主页是 [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow)4 个发布 checkpoint 归在 [`nvidia/anyflow`](https://huggingface.co/collections/nvidia/anyflow) Hugging Face collection 里。
AnyFlow 由 NVIDIA、新加坡国立大学(NUS)和 MIT 合作完成,作者为 Yuchao Gu、Guian Fang、Yuxin Jiang、Weijia Mao、Song Han、Han Cai、Mike Zheng Shou。原始训练代码在 [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow),项目主页是 [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow)4 个发布 checkpoint 归在 [`nvidia/anyflow`](https://huggingface.co/collections/nvidia/anyflow) Hugging Face collection 里。

本文档梳理实战要点:怎么选 pipeline、怎么用 any-step 采样、怎么把 AnyFlow 嵌进 T2V / I2V / V2V 工作流。

Expand Down Expand Up @@ -100,7 +100,7 @@ prompt = "森林里一只小熊猫在啃竹子,电影感光照"
for nfe in [1, 2, 4, 8, 16, 32]:
# 每轮重建 generator —— 这样跨步数对比时唯一变量是 NFE。
generator = torch.Generator("cuda").manual_seed(0)
video = pipe(prompt, num_inference_steps=nfe, num_frames=33, generator=generator).frames[0]
video = pipe(prompt, num_inference_steps=nfe, num_frames=81, generator=generator).frames[0]
export_to_video(video, f"out_nfe{nfe}.mp4", fps=16)
```

Expand All @@ -125,11 +125,11 @@ Causal pipeline 用同一个蒸馏模型支持三种任务模式,**通过 `vid
Context tensor 的帧数必须满足 `T = 4n + 1`,跟 VAE 时间步长对齐。

> [!IMPORTANT]
> FAR pipeline 是分块 (chunk) rollout,`num_frames` 必须配合 chunk 调度。默认
> `chunk_partition=[1, 3, 3, 3, 3, 3, 3, 2]`(求和 21)对应发布 checkpoint 的标准 `num_frames=81`
> (21 = (81 − 1) // 4 + 1)。改 `num_frames` 时**必须**显式传匹配的 `chunk_partition`,使其求和等于
> `(num_frames - 1) // 4 + 1`,否则 pipeline 会抛 `AssertionError`。比如 `num_frames=33` 对应 9 个 latent
> 帧,可用 `chunk_partition=[1, 4, 4]`。
> FAR pipeline 是分块 (chunk) rollout,`num_frames` 必须配合 chunk 调度。发布的 checkpoint 在
> transformer config 里写入 `chunk_partition=[1, 3, 3, 3, 3, 3, 3, 2]`(求和 21),对应标准
> `num_frames=81`(21 = (81 − 1) // 4 + 1)。改 `num_frames` 时**必须**显式传匹配的 `chunk_partition`,
> 使其求和等于 `(num_frames - 1) // 4 + 1`,否则 pipeline 会抛 `ValueError`。比如 `num_frames=33` 对应
> 9 个 latent 帧,可用 `chunk_partition=[1, 4, 4]`。

```py
import numpy as np
Expand Down Expand Up @@ -183,33 +183,6 @@ export_to_video(video, "v2v.mp4", fps=16)
如果你已经有 VAE 编码过的 latent,可以直接传 `video_latents=<tensor>` 跳过 `vae_encode` 步骤
(和 `video` 互斥)。

## 显存与推理速度

14B 的 AnyFlow 模型用 group offload + VAE slicing 单卡 40 GB 能跑:

```py
import torch
from diffusers import AnyFlowPipeline
from diffusers.hooks import apply_group_offloading

pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
)
apply_group_offloading(pipe.transformer, onload_device="cuda", offload_type="leaf_level")
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
```

延迟方面,`torch.compile` 对 transformer(最重的模块)效果很好:

```py
pipe = pipe.to("cuda")
pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")
```

编译开销跑几步就摊销掉;配合 AnyFlow 的低 NFE(4-8 步),`torch.compile` 在 14B 上相比 eager
模式有明显加速。

## LoRA 微调

两个 pipeline 都复用 [`WanLoraLoaderMixin`](../api/loaders/lora),因此为对应 Wan2.1 backbone 训练的
Expand Down
12 changes: 10 additions & 2 deletions scripts/convert_anyflow_to_diffusers.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,13 +57,21 @@
"AnyFlow-FAR-Wan2.1-1.3B-Diffusers": {
"base_model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
"transformer_cls": AnyFlowFARTransformer3DModel,
"transformer_kwargs": {"full_chunk_limit": 3, "compressed_patch_size": [1, 4, 4]},
"transformer_kwargs": {
"full_chunk_limit": 3,
"compressed_patch_size": [1, 4, 4],
"chunk_partition": [1, 3, 3, 3, 3, 3, 3, 2],
},
"pipeline_cls": AnyFlowFARPipeline,
},
"AnyFlow-FAR-Wan2.1-14B-Diffusers": {
"base_model": "Wan-AI/Wan2.1-T2V-14B-Diffusers",
"transformer_cls": AnyFlowFARTransformer3DModel,
"transformer_kwargs": {"full_chunk_limit": 3, "compressed_patch_size": [1, 4, 4]},
"transformer_kwargs": {
"full_chunk_limit": 3,
"compressed_patch_size": [1, 4, 4],
"chunk_partition": [1, 3, 3, 3, 3, 3, 3, 2],
},
"pipeline_cls": AnyFlowFARPipeline,
},
"AnyFlow-Wan2.1-T2V-1.3B-Diffusers": {
Expand Down
Loading
Loading