huggingface · Enderfga · May 22, 2026 · May 23, 2026 · May 23, 2026 · May 23, 2026
diff --git a/docs/source/en/api/models/anyflow_far_transformer3d.md b/docs/source/en/api/models/anyflow_far_transformer3d.md
@@ -13,19 +13,22 @@ specific language governing permissions and limitations under the License.
 # AnyFlowFARTransformer3DModel
 
 The causal (FAR) 3D Transformer used by [`AnyFlowFARPipeline`](../pipelines/anyflow#anyflowfarpipeline) —
-the FAR variant of [AnyFlow](https://huggingface.co/papers/2605.13724) (Yuchao Gu, Guian Fang et al., NUS
-ShowLab × NVIDIA). It extends the v0.35.1 Wan2.1 backbone with three additions:
+the FAR variant of [AnyFlow](https://huggingface.co/papers/2605.13724). See the
+[`AnyFlowFARPipeline`](../pipelines/anyflow) page for paper, authors, and released checkpoints. It extends
+the v0.35.1 Wan2.1 backbone with three additions:
 
-1. **FAR causal block-mask** via `torch.nn.attention.flex_attention`, supporting frame-level autoregressive
-   generation as introduced in [FAR (Gu et al., 2025)](https://arxiv.org/abs/2503.19325).
+1. **FAR causal block-mask** via `torch.nn.attention.flex_attention`, supporting chunk-wise autoregressive
+   generation as introduced in [FAR](https://huggingface.co/papers/2503.19325).
 2. **Compressed-frame patch embedding** (`far_patch_embedding`) for context (already-generated) frames,
    warm-started from the full-resolution `patch_embedding` at construction time via trilinear interpolation.
 3. **Dual-timestep flow-map embedding** (same as
    [`AnyFlowTransformer3DModel`](anyflow_transformer3d)) — every forward call conditions on both the source
    timestep ``t`` and the target timestep ``r``.
 
-The chunk schedule (`chunk_partition`) is **not** baked into the model config. It is a per-call argument to
-`forward`, so the same checkpoint handles different `num_frames` configurations without retraining.
+The default chunk schedule (`chunk_partition`) is stored in the model config; the released NVIDIA AnyFlow-FAR
+checkpoints use `[1, 3, 3, 3, 3, 3, 3, 2]` for the canonical 81-frame setting. `forward` accepts a per-call
+`chunk_partition` override, so the same checkpoint also handles other `num_frames` configurations without
+retraining.
 
 ```python
 from diffusers import AnyFlowFARTransformer3DModel

diff --git a/docs/source/en/api/models/anyflow_transformer3d.md b/docs/source/en/api/models/anyflow_transformer3d.md
@@ -16,10 +16,11 @@ The bidirectional 3D Transformer used by [`AnyFlowPipeline`](../pipelines/anyflo
 v0.35.1 Wan2.1 backbone with one structural change: the timestep embedder is replaced by
 ``AnyFlowDualTimestepTextImageEmbedding``, so every forward call conditions on both the source timestep
 ``t`` and the target timestep ``r``. This is the embedding required to learn the flow map
-:math:`\Phi_{r\leftarrow t}` introduced in
-[AnyFlow](https://huggingface.co/papers/2605.13724) (Yuchao Gu, Guian Fang et al., NUS ShowLab × NVIDIA).
+$\Phi_{r\leftarrow t}$ introduced in
+[AnyFlow](https://huggingface.co/papers/2605.13724). See the [`AnyFlowPipeline`](../pipelines/anyflow) page
+for paper, authors, and released checkpoints.
 
-For frame-level autoregressive (FAR causal) generation, use
+For chunk-wise autoregressive (FAR causal) generation, use
 [`AnyFlowFARTransformer3DModel`](anyflow_far_transformer3d) instead.
 
 ```python

diff --git a/docs/source/en/api/pipelines/anyflow.md b/docs/source/en/api/pipelines/anyflow.md
@@ -20,68 +20,28 @@ specific language governing permissions and limitations under the License.
 
 # AnyFlow
 
-[AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation](https://huggingface.co/papers/2605.13724) by Yuchao Gu, Guian Fang and collaborators at [NUS ShowLab](https://sites.google.com/view/showlab) in collaboration with NVIDIA.
+[AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation](https://huggingface.co/papers/2605.13724) from NVIDIA, National University of Singapore, and Massachusetts Institute of Technology, by Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai, Mike Zheng Shou.
+
+> **TL;DR:** AnyFlow is the first any-step video diffusion framework built on flow maps, which enables a single model (bidirectional or causal) to adapt to arbitrary inference budgets.
 
 *Few-step video generation has been significantly advanced by consistency models. However, their performance often degrades in any-step video diffusion models due to the fixed-point formulation. To address this limitation, we present AnyFlow, the first any-step video diffusion distillation framework built on flow maps. Instead of learning only the mapping z_t → z_0, AnyFlow learns transitions z_t → z_r over arbitrary time intervals, enabling a single model to adapt to different inference budgets. We design an improved forward flow map training recipe that fine-tunes pretrained video diffusion models into flow map models, and introduce Flow Map Backward Simulation to enable on-policy distillation for flow map models. Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B, on text-to-video and image-to-video tasks demonstrate that AnyFlow outperforms consistency-based baselines while preserving high fidelity and flexible sampling under varying step budgets.*
 
-The original training code is at [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow). The project page is at [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow).
+The AnyFlow pipelines were contributed by the AnyFlow Team. The original code is available on [GitHub](https://github.com/NVlabs/AnyFlow), the project page is at [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow), and pretrained models can be found in the [nvidia/anyflow](https://huggingface.co/collections/nvidia/anyflow) collection on Hugging Face.
 
-The following AnyFlow checkpoints are supported:
+Available Models:
 
 | Checkpoint | Backbone | Description |
-|------------|----------|-------------|
-| [`nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers) | Wan2.1 1.3B | Bidirectional T2V, lightweight |
-| [`nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers) | Wan2.1 14B | Bidirectional T2V, full quality |
+|---|---|---|
+| [`nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers) | Wan2.1 1.3B | Bidirectional T2V |
+| [`nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers) | Wan2.1 14B | Bidirectional T2V |
 | [`nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers) | FAR + Wan2.1 1.3B | Causal T2V / I2V / V2V |
 | [`nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers) | FAR + Wan2.1 14B | Causal T2V / I2V / V2V |
 
-All four are grouped under the [`nvidia/anyflow`](https://huggingface.co/collections/nvidia/anyflow) Hugging Face collection.
-
 > [!TIP]
-> Choose `AnyFlowPipeline` for traditional bidirectional text-to-video generation. Choose `AnyFlowFARPipeline` for streaming I2V, video continuation (V2V), or any setup that benefits from frame-by-frame autoregressive sampling.
-
-> [!TIP]
-> AnyFlow supports any-step sampling: a single distilled checkpoint can be evaluated at 1, 2, 4, 8, 16... NFE without retraining. Quality scales monotonically with steps in our benchmarks.
-
-### Optimizing Memory and Inference Speed
-
-<hfoptions id="optimization">
-<hfoption id="memory">
-
-```py
-import torch
-from diffusers import AnyFlowPipeline
-from diffusers.hooks import apply_group_offloading
-
-pipe = AnyFlowPipeline.from_pretrained(
-    "nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
-)
-apply_group_offloading(pipe.transformer, onload_device="cuda", offload_type="leaf_level")
-pipe.vae.enable_slicing()
-pipe.vae.enable_tiling()
-```
-
-</hfoption>
-<hfoption id="inference speed">
-
-```py
-import torch
-from diffusers import AnyFlowPipeline
-
-pipe = AnyFlowPipeline.from_pretrained(
-    "nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
-).to("cuda")
-pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")
-```
-
-</hfoption>
-</hfoptions>
+> `AnyFlowPipeline` is designed for bidirectional diffusion models in text-to-video (T2V) generation. `AnyFlowFARPipeline` is a chunk-wise causal diffusion model that supports text-to-video (T2V) generation, image-to-video (I2V) generation, and video continuation (V2V).
 
 ### Generation with AnyFlow (Bidirectional T2V)
 
-<hfoptions id="anyflow-bidi">
-<hfoption id="usage">
-
 ```py
 import torch
 from diffusers import AnyFlowPipeline
@@ -91,14 +51,16 @@ pipe = AnyFlowPipeline.from_pretrained(
     "nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers", torch_dtype=torch.bfloat16
 ).to("cuda")
 
-prompt = "A red panda eating bamboo in a forest, cinematic lighting"
-video = pipe(prompt, num_inference_steps=4, num_frames=33).frames[0]
-export_to_video(video, "out.mp4", fps=16)
+prompt = (
+    "An astronaut runs smoothly and appears almost weightless on the lunar surface, "
+    "as seen from a low-angle shot that highlights the vast, desolate background of the moon. "
+    "The moon's craters and rocky terrain are clearly visible, creating a stark contrast against "
+    "the running astronaut who moves with graceful, fluid motions."
+)
+video = pipe(prompt, num_inference_steps=4, num_frames=81).frames[0]
+export_to_video(video, "anyflow_t2v.mp4", fps=16)
 ```
 
-</hfoption>
-</hfoptions>
-
 ### Generation with AnyFlow (FAR Causal)
 
 The causal pipeline selects between T2V / I2V / V2V via the ``video`` (or ``video_latents``) argument:
@@ -108,10 +70,10 @@ clip for V2V continuation. If you already have pre-encoded latents in the model
 ``video_latents=<tensor>`` to skip VAE encoding. ``video`` and ``video_latents`` are mutually exclusive.
 
 > [!IMPORTANT]
-> `AnyFlowFARPipeline.default_chunk_partition = [1, 3, 3, 3, 3, 3, 3, 2]` (sum 21) is matched to the
-> released checkpoints' canonical 81 raw frames (21 latent frames at the VAE temporal stride of 4). When
-> you change `num_frames`, you must also pass a matching `chunk_partition` summing to
-> `(num_frames - 1) // 4 + 1`, otherwise the pipeline raises an `AssertionError`.
+> The released checkpoints bake `chunk_partition=[1, 3, 3, 3, 3, 3, 3, 2]` (sum 21) into the transformer
+> config, matched to the canonical 81 raw frames (21 latent frames at the VAE temporal stride of 4). When
+> you change `num_frames`, pass a matching `chunk_partition` summing to `(num_frames - 1) // 4 + 1`,
+> otherwise the pipeline raises a `ValueError`.
 
 <hfoptions id="anyflow-far">
 <hfoption id="t2v">
@@ -125,12 +87,12 @@ pipe = AnyFlowFARPipeline.from_pretrained(
     "nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
 ).to("cuda")
 
-video = pipe(
-    prompt="A cat surfing a wave, sunset",
-    num_inference_steps=4,
-    num_frames=81,
-).frames[0]
-export_to_video(video, "out.mp4", fps=16)
+prompt = (
+    "An astronaut runs smoothly and appears almost weightless on the lunar surface, "
+    "as seen from a low-angle shot that highlights the vast, desolate background of the moon."
+)
+video = pipe(prompt, num_inference_steps=4, num_frames=81).frames[0]
+export_to_video(video, "anyflow_far_t2v.mp4", fps=16)
 ```
 
 </hfoption>
@@ -146,18 +108,25 @@ pipe = AnyFlowFARPipeline.from_pretrained(
     "nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
 ).to("cuda")
 
-# Wrap the conditioning image as a one-frame video tensor: (1, 1, 3, H, W) in [0, 1].
-first_frame = load_image("path/to/first_frame.png").resize((832, 480))
+# Example conditioning image from the AnyFlow repo.
+first_frame = load_image(
+    "https://raw.githubusercontent.com/NVlabs/AnyFlow/main/assets/evaluation/example/images/1.jpg"
+).resize((832, 480))
 arr = np.asarray(first_frame).astype("float32") / 255.0  # (480, 832, 3)
-context_tensor = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(1).to("cuda")
+context_tensor = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(1).to("cuda")  # (1, 1, 3, 480, 832)
 
+prompt = (
+    "A towering, battle-scarred humanoid robot, reminiscent of a Transformer with powerful, segmented armor "
+    "and glowing red optics, walking through the skeletal remains of a city ruin. Twisted metal and shattered "
+    "concrete crunch under its heavy steps, as the robot scans the desolate, dust-choked skyline under an dark sky."
+)
 video = pipe(
-    prompt="a cat walks across a sunlit lawn",
+    prompt=prompt,
     video=context_tensor,
     num_inference_steps=4,
     num_frames=81,
 ).frames[0]
-export_to_video(video, "out.mp4", fps=16)
+export_to_video(video, "anyflow_far_i2v.mp4", fps=16)
 ```
 
 </hfoption>
@@ -173,21 +142,26 @@ pipe = AnyFlowFARPipeline.from_pretrained(
     "nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
 ).to("cuda")
 
-# Context clip — 9 raw frames map to 3 latent frames (9 = 4·2 + 1, 3 = 2 + 1).
-context_frames = load_video("path/to/context.mp4")[:9]
+# Example conditioning clip from the AnyFlow repo — take the first 9 frames (3 latent frames at VAE temporal stride 4).
+context_frames = load_video(
+    "https://raw.githubusercontent.com/NVlabs/AnyFlow/main/assets/evaluation/example/videos/2.mp4"
+)[:9]
 arr = np.stack([np.asarray(f.resize((832, 480))) for f in context_frames]).astype("float32") / 255.0
-# np.stack gives (T, H, W, C) = (9, 480, 832, 3) → permute to (T, C, H, W) then add batch.
 context_tensor = torch.from_numpy(arr).permute(0, 3, 1, 2).unsqueeze(0).to("cuda")  # (1, 9, 3, 480, 832)
 
+prompt = (
+    "A focused trail runner's powerful strides through a dense, sun-dappled forest. "
+    "The camera tracks alongside, highlighting muscular exertion, sweat, and determined facial expression."
+)
 video = pipe(
-    prompt="continue the story",
+    prompt=prompt,
     video=context_tensor,
     num_inference_steps=4,
     num_frames=81,
     # Override chunk_partition so the first chunk covers exactly the 3 latent context frames.
     chunk_partition=[3, 3, 3, 3, 3, 3, 3],
 ).frames[0]
-export_to_video(video, "out.mp4", fps=16)
+export_to_video(video, "anyflow_far_v2v.mp4", fps=16)
 ```
 
 </hfoption>

diff --git a/docs/source/zh/using-diffusers/anyflow.md b/docs/source/zh/using-diffusers/anyflow.md
@@ -22,7 +22,7 @@ NFE 增加反而经常掉点。
 采样步之间的 re-noising；on-policy 蒸馏阶段额外用 **DMD 反向散度监督** + **Flow-Map backward simulation**
 （3 段 shortcut）补上 consistency 蒸馏遗留的 exposure-bias 缺口。
 
-AnyFlow 由 Yuchao Gu、Guian Fang 等人在 [NUS ShowLab](https://sites.google.com/view/showlab) 与 NVIDIA 合作完成。原始训练代码在 [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow)，项目主页是 [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow)。4 个发布 checkpoint 归在 [`nvidia/anyflow`](https://huggingface.co/collections/nvidia/anyflow) Hugging Face collection 里。
+AnyFlow 由 NVIDIA、新加坡国立大学（NUS）和 MIT 合作完成，作者为 Yuchao Gu、Guian Fang、Yuxin Jiang、Weijia Mao、Song Han、Han Cai、Mike Zheng Shou。原始训练代码在 [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow)，项目主页是 [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow)，4 个发布 checkpoint 归在 [`nvidia/anyflow`](https://huggingface.co/collections/nvidia/anyflow) Hugging Face collection 里。
 
 本文档梳理实战要点：怎么选 pipeline、怎么用 any-step 采样、怎么把 AnyFlow 嵌进 T2V / I2V / V2V 工作流。
 
@@ -100,7 +100,7 @@ prompt = "森林里一只小熊猫在啃竹子，电影感光照"
 for nfe in [1, 2, 4, 8, 16, 32]:
     # 每轮重建 generator —— 这样跨步数对比时唯一变量是 NFE。
     generator = torch.Generator("cuda").manual_seed(0)
-    video = pipe(prompt, num_inference_steps=nfe, num_frames=33, generator=generator).frames[0]
+    video = pipe(prompt, num_inference_steps=nfe, num_frames=81, generator=generator).frames[0]
     export_to_video(video, f"out_nfe{nfe}.mp4", fps=16)
 ```
 
@@ -125,11 +125,11 @@ Causal pipeline 用同一个蒸馏模型支持三种任务模式，**通过 `vid
 Context tensor 的帧数必须满足 `T = 4n + 1`，跟 VAE 时间步长对齐。
 
 > [!IMPORTANT]
-> FAR pipeline 是分块 (chunk) rollout，`num_frames` 必须配合 chunk 调度。默认
-> `chunk_partition=[1, 3, 3, 3, 3, 3, 3, 2]`（求和 21）对应发布 checkpoint 的标准 `num_frames=81`
-> （21 = (81 − 1) // 4 + 1）。改 `num_frames` 时**必须**显式传匹配的 `chunk_partition`，使其求和等于
-> `(num_frames - 1) // 4 + 1`，否则 pipeline 会抛 `AssertionError`。比如 `num_frames=33` 对应 9 个 latent
-> 帧，可用 `chunk_partition=[1, 4, 4]`。
+> FAR pipeline 是分块 (chunk) rollout，`num_frames` 必须配合 chunk 调度。发布的 checkpoint 在
+> transformer config 里写入 `chunk_partition=[1, 3, 3, 3, 3, 3, 3, 2]`（求和 21），对应标准
+> `num_frames=81`（21 = (81 − 1) // 4 + 1）。改 `num_frames` 时**必须**显式传匹配的 `chunk_partition`，
+> 使其求和等于 `(num_frames - 1) // 4 + 1`，否则 pipeline 会抛 `ValueError`。比如 `num_frames=33` 对应
+> 9 个 latent 帧，可用 `chunk_partition=[1, 4, 4]`。
 
 ```py
 import numpy as np
@@ -183,33 +183,6 @@ export_to_video(video, "v2v.mp4", fps=16)
 如果你已经有 VAE 编码过的 latent，可以直接传 `video_latents=<tensor>` 跳过 `vae_encode` 步骤
 （和 `video` 互斥）。
 
-## 显存与推理速度
-
-14B 的 AnyFlow 模型用 group offload + VAE slicing 单卡 40 GB 能跑：
-
-```py
-import torch
-from diffusers import AnyFlowPipeline
-from diffusers.hooks import apply_group_offloading
-
-pipe = AnyFlowPipeline.from_pretrained(
-    "nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
-)
-apply_group_offloading(pipe.transformer, onload_device="cuda", offload_type="leaf_level")
-pipe.vae.enable_slicing()
-pipe.vae.enable_tiling()
-```
-
-延迟方面，`torch.compile` 对 transformer（最重的模块）效果很好：
-
-```py
-pipe = pipe.to("cuda")
-pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")
-```
-
-编译开销跑几步就摊销掉；配合 AnyFlow 的低 NFE（4-8 步），`torch.compile` 在 14B 上相比 eager
-模式有明显加速。
-
 ## LoRA 微调
 
 两个 pipeline 都复用 [`WanLoraLoaderMixin`](../api/loaders/lora)，因此为对应 Wan2.1 backbone 训练的

diff --git a/scripts/convert_anyflow_to_diffusers.py b/scripts/convert_anyflow_to_diffusers.py
@@ -57,13 +57,21 @@
     "AnyFlow-FAR-Wan2.1-1.3B-Diffusers": {
         "base_model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
         "transformer_cls": AnyFlowFARTransformer3DModel,
-        "transformer_kwargs": {"full_chunk_limit": 3, "compressed_patch_size": [1, 4, 4]},
+        "transformer_kwargs": {
+            "full_chunk_limit": 3,
+            "compressed_patch_size": [1, 4, 4],
+            "chunk_partition": [1, 3, 3, 3, 3, 3, 3, 2],
+        },
         "pipeline_cls": AnyFlowFARPipeline,
     },
     "AnyFlow-FAR-Wan2.1-14B-Diffusers": {
         "base_model": "Wan-AI/Wan2.1-T2V-14B-Diffusers",
         "transformer_cls": AnyFlowFARTransformer3DModel,
-        "transformer_kwargs": {"full_chunk_limit": 3, "compressed_patch_size": [1, 4, 4]},
+        "transformer_kwargs": {
+            "full_chunk_limit": 3,
+            "compressed_patch_size": [1, 4, 4],
+            "chunk_partition": [1, 3, 3, 3, 3, 3, 3, 2],
+        },
         "pipeline_cls": AnyFlowFARPipeline,
     },
     "AnyFlow-Wan2.1-T2V-1.3B-Diffusers": {