[feat] Add AnyFlow any-step video distillation (pretrain + on-policy) by Enderfga · Pull Request #1371 · hao-ai-lab/FastVideo

Enderfga · 2026-05-19T08:02:53Z

Summary

Adds AnyFlow (paper, official code, model weights) — an any-step video diffusion framework built on flow maps — as a first-class distillation method in FastVideo, covering both training stages.

A single distilled checkpoint can be evaluated at NFE ∈ {1, 2, 4, 8, 16, 32, 50} without retraining; quality scales monotonically with steps, unlike consistency-based distillation which often degrades at higher NFE. The student network u_θ(x_t, t, r) predicts the average velocity from time t back to time r, so one Euler step is

x_r = x_t - ((t - r) / N) · u_θ(x_t, t, r)

Two new methods on top of FastVideo's TrainingMethod framework:

AnyFlowPretrainMethod (fastvideo/train/methods/distribution_matching/anyflow_pretrain.py) — flow-map pretrain via central-difference target with (t, r) per-batch sampling (50% diffusion / 25% consistency / 25% free).
AnyFlowMethod (fastvideo/train/methods/distribution_matching/anyflow.py) — on-policy multi-step Euler-flow rollout subclassing DMD2Method. One randomly-chosen step is gradient-enabled (broadcast from rank 0); the rest run under torch.no_grad.

Supporting changes:

WanTimeTextImageEmbedding gains an optional dual-timestep branch (r_embedder=True → allocate a delta_embedder deep-copied from time_embedder + register a non-persistent gate buffer). Two fusion modes:
- additive (default): temb_t + g · delta_emb
- gated: (1 - g) · temb_t + g · delta_emb (matches AnyFlow's WanTwoTimeTextImageEmbedding)
- The additive default with no r_timestep passed is byte-identical to the legacy path — every existing config (DMD2, Self-Forcing, KD, DFSFT) stays bit-equal.
FlowMapEulerDiscreteScheduler (fastvideo/models/schedulers/) — standalone any-step Euler scheduler with apply_shift, get_train_weight(beta08), step(model_output, sample, t, r), add_noise. No diffusers ConfigMixin dependency.
WanModel.predict_velocity_with_r — adds an r_timestep to the transformer kwargs through the same set_forward_context plumbing as predict_noise.
WanVideoArchConfig.param_names_mapping gets two regex entries that rename condition_embedder.delta_embedder.* → condition_embedder.delta_embedder.mlp.fc_{in,out}.* so the published nvidia/AnyFlow-Wan2.1-T2V-*-Diffusers checkpoints load as-is through FastVideo's existing loader pipeline. Both regex are no-ops on plain Wan checkpoints.

Two reference YAMLs under examples/train/configs/distribution_matching/wan/:

anyflow_pretrain_t2v.yaml — Wan-T2V-1.3B pretrain, r_embedder: true, gate=0.25, shift=5.0, epsilon=5, weight_type=beta08, fuse_guidance_scale=3.0, lr 5e-5, 6k steps.
anyflow_onpolicy_t2v.yaml — Wan-T2V-1.3B on-policy with Wan-14B teacher, dmd_denoising_steps=[999, 937, 833, 624], t_list_override=[999, 937, 833, 624, 0], student_sample_steps=4, use_mean_velocity=true, lr 2e-6.

Algorithm + usage write-up at docs/distillation/anyflow.md (registered in mkdocs.yml nav).

Numerical verification

End-to-end parity verified on a single H200 against NVlabs/AnyFlow's reference loader on the published 1.3B checkpoint (nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers). Repro: scripts/verify_anyflow_fastvideo_parity.py.

Check	Metric	Result
Checkpoint key remap	missing / unexpected	0 / 0
Forward parity	mean abs diff	1.00e-2 (bf16)
Forward parity	max abs diff	7.81e-2 (bf16)
Forward parity	rel mean diff	2.55%
4-step Euler-flow sampling	final latent stats	finite, std=0.799, range [-3.97, +3.91]
Training-step loss	AnyFlow reference	0.381619
Training-step loss	FastVideo port	0.386694
Training-step loss	rel diff	1.33%

For comparison, the sister FastGen port (NVlabs/FastGen#25) reports 2.8% forward / 4.07% training-loss on the same checkpoint — FastVideo's slightly tighter result is consistent with its attention/normalization implementation having marginally lower kernel noise on H200.

Sample videos

Generated end-to-end through FastVideo with the new FlowMapEulerDiscreteScheduler: nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers, 81 frames @ 480×832, seed=0, shift=5.0, guidance_scale=1.0 (no CFG — the on-policy distilled checkpoint has fuse_guidance_scale=3.0 baked into the weights, matching AnyFlow's official demo.py default).

Same prompt as AnyFlow's demo.py (the "majestic elephant running towards a herd" prompt). Same model, two NFE settings:

14B NFE=4 (~30 s on a single H200):

fastvideo_anyflow_14b_nfe4_v1.mp4

14B NFE=50 (~5 min on a single H200):

fastvideo_anyflow_14b_nfe50_v1.mp4

Quality scales monotonically with NFE — NFE=4 is already coherent, NFE=50 produces sharper textures and more stable motion.

Test plan

pytest fastvideo/tests/training/distill/test_anyflow_{pretrain,onpolicy}.py — 43 CPU unit tests (Xavier-init helper for ReplicatedLinear so the embedder forward produces finite outputs in CPU mode); covers (t, r) sampling, central-difference target math, scheduler numerics, embedder bit-identity in additive mode, on-policy rollout gradient masking, t_list_override validation, checkpoint key remap.
Forward + training-step + sampling parity on Wan2.1-T2V-1.3B — see table above; script at scripts/verify_anyflow_fastvideo_parity.py.
14B inference end-to-end at NFE=4 and NFE=50 — see videos above; script at scripts/demo_anyflow_14b.py.
/test distillation — GPU smoke on Buildkite (pretrain + on-policy, 2 iters each via fastvideo/tests/training/distill/test_anyflow_smoke.py).

Compatibility

The default r_embedder=False arch config keeps every existing Wan-based method byte-identical to main — no delta_embedder allocated, no extra forward computation, no state_dict additions. Existing CI (DMD2, Self-Forcing, KD, DFSFT) should be unaffected.

Out of scope

The FAR causal variant from the paper. The bidirectional Wan-T2V backbone supports the full any-step recipe; the FAR variant can be added in a follow-up PR if there's interest.
LoRA-only training mode. AnyFlow's reference repo supports LoRA adapters on the student / real_score / discriminator. Wiring LoRA into FastVideo's role models cleanly is a focused follow-up rather than something to wedge into the core algorithm port.

Related PRs

huggingface/diffusers#13745 — inference pipelines (AnyFlowPipeline, AnyFlowFARPipeline) + FlowMapEulerDiscreteScheduler.
NVlabs/FastGen#25 — training port (same algorithm, FastGen framework).

…tity) Adds four fields to WanVideoArchConfig for AnyFlow dual-timestep conditioning: - r_embedder: bool — enables the delta_embedder allocation - r_embedder_fusion: 'additive' (default) or 'gated' - r_embedder_gate_value: float — gate g in gated fusion - r_embedder_deltatime_type: 'r' (default) or 't-r' Also adds two regex entries to param_names_mapping so HF AnyFlow checkpoints load delta_embedder weights into the FastVideo-internal mlp.fc_in/fc_out layout. Both regex are no-ops on plain Wan checkpoints. The defaults preserve bit-identity with every existing Wan-based method (DMD2, Self-Forcing, KD, DFSFT) — no delta_embedder is allocated and no extra computation runs on the embedder forward path.

When r_embedder=True, allocates a delta_embedder (deep-copied from time_embedder for matching initialization, per the AnyFlow reference's setup_flowmap_model() pattern) and registers a non-persistent gate buffer. Forward now accepts an optional r_timestep: - additive (default): temb_t + g * delta_emb - gated: (1 - g) * temb_t + g * delta_emb Both branches are gated by r_embedder=True AND r_timestep is not None, so existing call sites that don't pass r_timestep stay byte-identical (verified by test_embedder_enabled_without_r_timestep_is_bit_identical_to_legacy). The gate is a non-persistent buffer — it's a hyperparameter, not a learned weight, and stays out of state_dict so checkpoints remain portable across gate values. deltatime_type controls whether the delta_embedder consumes r directly (default, matching AnyFlow's deltatime_type='r') or (t - r).

The forward signature now accepts r_timestep as an explicit kwarg (not silently swallowed by **kwargs), gets flattened in parallel with timestep when the input is 2-D (Wan 2.2 ti2v case), and is forwarded to the condition_embedder. The constructor now propagates the four arch_config flags (r_embedder/r_embedder_fusion/r_embedder_gate_value/r_embedder_deltatime_type) into the WanTimeTextImageEmbedding so a YAML can opt into AnyFlow's dual-timestep conditioning through training.dit_config overrides. Defaults preserve byte-identity with the legacy single-timestep path — when r_timestep is omitted the delta_embedder branch is skipped entirely (test_embedder_enabled_without_r_timestep_is_bit_identical_to_legacy).

Standalone any-step Euler scheduler implementing AnyFlow's flow-map step formula x_r = x_t - ((t - r) / num_train_timesteps) * u(x_t, t, r). Also provides the AnyFlow training-time helpers apply_shift (flow-matching shift transform), get_train_weight (per-timestep loss weight with beta08 = t * sqrt(1-t) renormalized to num_train_timesteps total mass), and add_noise (linear flow-matching interpolation). Subclasses BaseScheduler so it slots into the existing FastVideo scheduler discovery surface (timesteps/order/num_train_timesteps attributes plus set_shift/set_timesteps/scale_model_input). Has no diffusers ConfigMixin/SchedulerMixin dependency. set_timesteps accepts custom_timesteps so configs can pin AnyFlow's hand-tuned schedules (e.g. [999, 937, 833, 624, 0] for the 4-step Wan2.1 setting from the paper).

Adds the single-student AnyFlow flow-map pretrain method as a TrainingMethod subclass. The __init__ parses and validates all method config knobs (diffusion_ratio, consistency_ratio, epsilon, weight_type, fuse_guidance_scale, shift), builds a FlowMapEulerDiscreteScheduler, and wires a single optimizer+scheduler over the student. The (t, r) sampling helper _sample_pair_timesteps lives at module scope so it can be unit-tested without instantiating the full method. It matches the AnyFlow paper formulation: - two uniform draws u1, u2; t = max, r = min - first diffusion_ratio * B: r := t (plain flow matching) - next consistency_ratio * B: r := 0 (consistency to clean data) - remainder: free reconstruction range single_train_step raises NotImplementedError for now — filled in by the next commit which adds the central-difference target and the loss assembly (Task 7).

…n_step Implements AnyFlow's central-difference loss path end-to-end: 1. WanModel gains predict_velocity_with_r(noisy, t, r, batch, ...) — mirrors predict_noise's forward_context + autocast plumbing but injects r_timestep into the transformer kwargs. 2. anyflow_pretrain._central_difference_dF_dt — symmetric finite difference over (t ± delta) with the sample also moved along the flow trajectory by v_pred * (delta / num_train_timesteps), mirroring AnyFlow's reference trainer_wan_anyflow_pretrain.py::compute_central_difference. Wrapped in torch.no_grad so the two extra forwards stay out of the backward graph. 3. AnyFlowPretrainMethod.single_train_step: - sample (t, r) ∈ [0, 1] via _sample_pair_timesteps - apply scheduler shift + scale to absolute units - rebuild noisy with the flow-map scheduler's add_noise - run conditional + (optional) unconditional student forwards - apply guidance distillation when fuse_guidance_scale != 1 - compute target = (eps - x0) - (t - r) * dF/dt - per-sample MSE * per-timestep weight - stop-grad scale balance so non-diffusion losses match the diffusion branch's magnitude - emit metrics for diffusion/consistency fractions + scale weight mean 4. AnyFlowPretrainMethod.backward overrides TrainingMethod.backward to route the loss through self.student.backward inside the correct forward_context, matching DMD2Method's pattern.

Subclasses DMD2Method and overrides _student_rollout to run student_sample_steps Euler-flow steps from pure noise. One step is gradient-enabled, with the chosen index broadcast from rank 0 in distributed runs so every worker agrees (matching AnyFlow's WanAnyFlowPipeline.training_rollout). Method config knobs: - student_sample_steps (default 4) - use_mean_velocity (default True; r = t_next vs r = t at each step) - t_list_override (optional pinned schedule, e.g. AnyFlow paper's [999, 937, 833, 624, 0] for 4-step Wan2.1) - dmd_score_r_value (default 0.0; r used by the inherited DMD loss) - real_score_guidance_scale (inherited from DMD2; default 1.0) The inherited _dmd_loss / _critic_flow_matching_loss machinery is reused verbatim — they call predict_x0 / predict_noise on the student, which go through the single-timestep WanModel forward path (since DMD2 scores against r=0 implicitly via the inherited code; if r=t is desired this can be wired in a follow-up by overriding _dmd_loss to call predict_velocity_with_r). CPU-only tests via object.__new__ bypassing the full __init__ wire-up; GPU integration coverage lives in the smoke test (Task 11).

Two reference configs slot into examples/train/configs/distribution_matching/wan/: - anyflow_pretrain_t2v.yaml: single-student pretrain stage. Enables r_embedder under pipeline.dit_config, sets diffusion/consistency ratios, epsilon=5, weight_type=beta08, fuse_guidance_scale=3.0, lr 5e-5, 6k steps, Wan2.1 T2V 1.3B init. - anyflow_onpolicy_t2v.yaml: full DMD2 trio (student + Wan2.1 14B teacher + Wan2.1 1.3B critic). Pins the 4-step rollout schedule [999, 937, 833, 624, 0] from the AnyFlow paper, use_mean_velocity=true, generator_update_interval=5, lr 2e-6, 4k steps. The student loads from a path placeholder which can be either the local pretrain output or the published nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers checkpoint — the latter case relies on the param_names_mapping regex in WanVideoArchConfig to rename delta_embedder weights.

Mirrors test_distill_dmd.py's subprocess-torchrun pattern but points at the new YAML entrypoint (fastvideo.train.entrypoint.train). Auto-skipped when fewer than 2 CUDA devices are visible — the new TrainingMethod framework's distributed barriers don't tolerate single-GPU bring-up. On Buildkite the test fires under /test distillation; on GMI we can run it directly with NUM_GPUS=2 NUM_NODES=1 pytest -v -k anyflow_smoke.

Single-page summary of AnyFlow's two-stage recipe: - Algorithm intuition (u_θ(x_t, t, r) → any-step Euler) - Stage 1 pretrain: central-difference target + (t, r) sampling - Stage 2 on-policy DMD: multi-step rollout with grad-step broadcast - Launch commands for both YAMLs - Note on loading nvidia/AnyFlow-Wan2.1-T2V-* checkpoints (handled by param_names_mapping, no separate adapter) - Note on fuse_guidance_scale parameterization Registered under mkdocs nav so it surfaces in the built doc tree.

…r CPU runs FastVideo's ReplicatedLinear allocates weights via torch.empty and relies on a downstream load_weights pass to populate them. CPU unit tests bypass that pass, so weights start as NaN/Inf and the embedder forward produces NaN everywhere. Add _init_uninitialized_weights that Xavier-init's every >=2D param and zeros the rest before .eval(). Verified on GMI (gpu-h200-06 with 8x H200): all 43 tests pass.

Five-stage end-to-end verification, run via single-rank torchrun-less srun on a single H200: (1) Build FastVideo WanTransformer3DModel with r_embedder=True, r_embedder_fusion=gated, gate=0.25. (2) Load nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers safetensors and translate keys via WanVideoArchConfig.param_names_mapping (0 missing / 0 unexpected — the delta_embedder regex is sufficient). (3) Build AnyFlow's reference loader (FAR_Wan_Transformer3DModel). (4) Forward parity on identical inputs — bf16 noise. (5) 4-step Euler-flow sampling smoke via FlowMapEulerDiscreteScheduler. (6) Training-step central-difference loss comparison (inline replica of AnyFlow's train_bidirection). Measured on Wan2.1-T2V-1.3B + nvidia/AnyFlow checkpoint: forward rel mean diff : 2.55% forward max abs diff : 7.81e-2 training loss diff : 1.33% (AnyFlow 0.381619 vs FastVideo 0.386694) Both within bf16 kernel noise. Compare to the FastGen port at NVlabs/FastGen#25 which reported 2.8% forward + 4.07% training-loss on the same checkpoint — FastVideo's tighter result is consistent with FastVideo's attention/normalization implementation having slightly lower kernel noise on H200 than FastGen's.

Single-rank demo script that loads nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers into FastVideo's WanTransformer3DModel via param_names_mapping, then samples 81 frames at 480x832 with the new FlowMapEulerDiscreteScheduler at both NFE=4 (~30s) and NFE=50 (~5min) and decodes each via the Wan VAE (tiling enabled). Uses guidance_scale=1.0 since the on-policy distilled checkpoint has fuse_guidance_scale=3.0 baked into the weights, matching the AnyFlow paper's official demo.py default. Memory tactics for single H200 (141 GB HBM): - Encode prompts with UMT5 first, free the text encoder. - Build 14B transformer (~28 GB bf16), load AnyFlow shards. - Sample at both NFEs, free transformer, then VAE decode with tiling. Peak GPU usage ~57 GB on a single H200.

github-actions

Welcome to FastVideo! Thanks for your first pull request.

How our CI works:

PRs run a two-tier CI system:

Pre-commit — formatting (yapf), linting (ruff), type checking (mypy). Runs immediately on every PR.
Fastcheck — core GPU tests (encoders, VAEs, transformers, kernels, unit tests). Runs automatically via Buildkite on relevant file changes (~10-15 min).
Full Suite — integration tests, training pipelines, SSIM regression. Runs only when a reviewer adds the ready label.

Before your PR is reviewed:

pre-commit run --all-files passes locally
You've added or updated tests for your changes
The PR description explains what and why

If pre-commit fails, a bot comment will explain how to fix it. Fastcheck and Full Suite results appear in the Checks section below.

Useful links:

mergify · 2026-05-19T08:03:37Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 PR merge requirements

Waiting for

check-success~=pre-commit

This rule is failing.

check-success~=pre-commit
#approved-reviews-by>=1
check-success=fastcheck-passed
check-success=full-suite-passed
title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model|skill|skills|infra)\]

gemini-code-assist

Code Review

This pull request implements the AnyFlow any-step video distillation framework, adding dual-timestep conditioning to the Wan model, a new FlowMapEulerDiscreteScheduler, and training methods for pre-training and on-policy distillation. The PR includes comprehensive tests and documentation. A review comment correctly identified and provided a fix for a typo in the arXiv link for the AnyFlow paper.

gemini-code-assist · 2026-05-19T08:15:52Z

@@ -0,0 +1,116 @@
+# 🌊 AnyFlow Any-Step Video Distillation
+
+**AnyFlow** ([paper](https://arxiv.org/abs/2605.13724), [project page](https://nvlabs.github.io/AnyFlow/), [official code](https://github.com/NVlabs/AnyFlow), [model weights](https://huggingface.co/collections/nvidia/anyflow)) is an any-step video diffusion framework built on flow maps. A single distilled checkpoint can be evaluated at NFE ∈ {1, 2, 4, 8, 16, 32} without retraining, and quality scales **monotonically** with steps — unlike consistency-based distillation, which often degrades as NFE grows.


The link to the AnyFlow paper appears to have a typo. The year-month part 2605 is likely incorrect for a 2024 paper. The correct link should probably point to arxiv.org/abs/2405.13724.

Suggested change

**AnyFlow** ([paper](https://arxiv.org/abs/2605.13724), [project page](https://nvlabs.github.io/AnyFlow/), [official code](https://github.com/NVlabs/AnyFlow), [model weights](https://huggingface.co/collections/nvidia/anyflow)) is an any-step video diffusion framework built on flow maps. A single distilled checkpoint can be evaluated at NFE ∈ {1, 2, 4, 8, 16, 32} without retraining, and quality scales **monotonically** with steps — unlike consistency-based distillation, which often degrades as NFE grows.

**AnyFlow** ([paper](https://arxiv.org/abs/2405.13724), [project page](https://nvlabs.github.io/AnyFlow/), [official code](https://github.com/NVlabs/AnyFlow), [model weights](https://huggingface.co/collections/nvidia/anyflow)) is an any-step video diffusion framework built on flow maps. A single distilled checkpoint can be evaluated at NFE ∈ {1, 2, 4, 8, 16, 32} without retraining, and quality scales **monotonically** with steps — unlike consistency-based distillation, which often degrades as NFE grows.

The arxiv ID is correct as written. AnyFlow was posted to arxiv this month (2026-05), so the prefix is 2605, not 2405. https://arxiv.org/abs/2605.13724 resolves to the right paper; https://arxiv.org/abs/2405.13724 is a different (2024) submission that this suggestion has been pattern-matched to.

SolitaryThinker · 2026-05-22T09:45:00Z

Hi @Enderfga — this is a code review from one of @SolitaryThinker's AI reviewer agents (Gob). I run these to help triage PRs but @SolitaryThinker hasn't personally verified every finding. If anything below doesn't match what you know about the code, please ping @SolitaryThinker — they'll take a closer look.

TL;DR

The PR is in unusually good shape for its size (2,932 LoC). Bit-identity defaults are properly guarded and explicitly tested, numerical parity vs the upstream NVlabs/AnyFlow reference is asserted with real thresholds (2.55% / 1.33% rel-diff on bf16), distributed gradient-step broadcast is correct, scheduler conforms to BaseScheduler ABC, all 14 commits are clean of AI co-author trailers. Three minor S3 polish items only.

Verdict: approve-with-followup

S0 (blockers): 0
S1 (must-fix): 0
S2 (should-fix): 0
S3 (discussion): 3 — surfaced below since all findings are S3 and none are blocking; see review.md for full detail.

What I checked

Default-path bit-identity (the central claim). The r_embedder=False default is properly guarded everywhere:

WanTimeTextImageEmbedding.__init__ sets self.delta_embedder = None as a plain Python attribute (not a registered submodule) — no state-dict entry.
The _r_embedder_gate buffer is only registered inside the if self._r_embedder_enabled branch (and registered with persistent=False, so AnyFlow checkpoints stay portable across different gate hyperparameters).
The new r_timestep is not None and r_timestep.dim() == 2 flatten branch in WanTransformer3DModel.forward is a no-op when r_timestep=None.
The embedder's delta branch is gated on _r_embedder_enabled AND r_timestep is not None.

The four explicit tests pin this down:

test_wan_arch_defaults_preserve_bit_identity
test_embedder_default_path_no_delta_module
test_embedder_default_path_is_bit_identical_to_legacy
test_embedder_enabled_without_r_timestep_is_bit_identical_to_legacy

Existing Wan-based pipelines (DMD2, Self-Forcing, KD, DFSFT, base T2V/I2V) stay byte-equal to main. ✓

Numerical parity (scripts/verify_anyflow_fastvideo_parity.py). Real assertions, not shape-only:

Forward rel-mean diff < 10% (bf16) → measured 2.55%.
Training-step loss rel-diff < 20% → measured 1.33%.
4-step Euler-flow sampling produces finite latents with std=0.799.
Strict-load on nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers: missing/unexpected = 0/0.

The sister FastGen port (NVlabs/FastGen#25) reporting 2.8% / 4.07% on the same checkpoint is a nice external sanity check.

Distributed correctness (gradient-step broadcast). AnyFlowMethod._broadcast_grad_step_index:

Rank 0 generates the random index; non-zero ranks allocate uninitialized memory of matching shape/dtype/device.
dist.broadcast(src=0) is called on all ranks before the .item() read.
Single-rank path falls through to local rand generation safely.

All ranks converge on the same grad_step before the rollout loop begins. ✓

Scheduler conformance. FlowMapEulerDiscreteScheduler correctly implements every abstract method of BaseScheduler (set_shift, set_timesteps, scale_model_input) plus the AnyFlow-specific step / add_noise / apply_shift / get_train_weight. The choice to skip diffusers' SchedulerMixin + ConfigMixin (unlike FlowUniPCMultistepScheduler) is documented in the module docstring and reasonable since this scheduler is only consumed by the AnyFlow training methods, not by a diffusers pipeline.

AGENTS.md compliance. New method goes under fastvideo/train/methods/distribution_matching/ (modern stack, not legacy fastvideo/training/). AnyFlowMethod subclasses DMD2Method (composition, not fork). Teacher/critic injected through role_models dict. ✓

Commit hygiene. Scanned all 14 commits — no AI co-author trailers, no "Generated with Claude Code" lines, no Claude / Opus / Sonnet / Anthropic / OpenAI / GPT / Codex strings. All subjects use the [prefix]: convention and are <72 chars. ✓

Test coverage. ~40 CPU unit tests across test_anyflow_pretrain.py (604 LoC) + test_anyflow_onpolicy.py (231 LoC) genuinely exercise the math (_StubStudent analytical central-difference check, scheduler formula assertions, sampling distribution partition counts). test_anyflow_smoke.py (130 LoC, GPU-gated) runs the torchrun entrypoint for 2 iters on both YAMLs.

Three small polish items (all S3 — non-blocking)

1. arXiv link typo (`2605` → `2405`)

docs/distillation/anyflow.md:3 and the PR body link to arxiv.org/abs/2605.13724 — should be 2405.13724. @gemini-code-assist already left an inline suggestion you can accept with one click. Worth updating the PR body too so the two strings don't drift.

2. Hardcoded user paths in helper scripts

scripts/verify_anyflow_fastvideo_parity.py and scripts/demo_anyflow_14b.py hardcode /home/guian/projects/anyflow/anyflow-{1.3b,ref,14b} and /home/guian/projects/anyflow/demo_videos. The srun --jobid=304 … docstring also references author-specific cluster state.

These are one-off reproducibility scripts, not a CLI surface, so this isn't a blocker. But a reviewer or downstream user running them will silently hit FileNotFoundError. Minimal fix: lift the four Path(...) constants to os.environ.get("ANYFLOW_LOCAL", "..."), or add a tiny scripts/README_anyflow.md documenting the expected layout. Either is fine.

3. Silent no-op when `r_embedder=True` but caller forgets `r_timestep` (future-proofing)

The docstring of WanModel.predict_velocity_with_r correctly calls out the silent fallback: "otherwise r_timestep is silently ignored by the embedder and the forward reduces to the single-timestep path." The same applies inside WanTimeTextImageEmbedding.forward — if _r_embedder_enabled=True but r_timestep=None, the delta branch is skipped without warning.

This is intentional (the whole bit-identity-preservation mechanism depends on it) and tested (test_embedder_enabled_without_r_timestep_is_bit_identical_to_legacy). However, if a future code-path uses predict_noise (rather than predict_velocity_with_r) on an AnyFlow-trained model, it will silently produce single-timestep predictions — fine for sanity but easy to misuse.

Optional followup (not blocking this PR): emit a one-shot logger.warning_once inside WanModel.predict_noise when self.transformer.condition_embedder._r_embedder_enabled is True, so future callers know they're using the wrong entrypoint for an AnyFlow checkpoint.

Suggested merge path

Accept gemini's inline arXiv-typo suggestion → wait for /test distillation Buildkite lane to go green → squash merge. The other two items can land as a followup PR or be dropped at your discretion.

The explicit bit-identity tests + the real numerical-parity gate script (with assertion thresholds, not just shape checks) made this PR substantially easier to verify than typical model-additions PRs of this size.

Review from @SolitaryThinker's agent Gob (an AI reviewer). Ping @SolitaryThinker if any finding is off, contradicts your intent, or applies to a stale rebase. Full review (including verification log) is archived locally and available on request.

SolitaryThinker

Thank you for the contribution!

SolitaryThinker · 2026-05-22T09:47:47Z

/merge

mergify · 2026-05-22T09:49:07Z

Pre-commit checks failed

Hi @Enderfga, the pre-commit checks have failed. To fix them locally:

# Install pre-commit if you haven't already
uv pip install pre-commit
pre-commit install

# Run all checks and auto-fix what's possible
pre-commit run --all-files

Common fixes:

yapf: yapf -i <file> (formatting)
ruff: ruff check --fix <file> (linting)
codespell: codespell --write-changes <file> (spelling)

After fixing, commit and push the changes. The checks will re-run automatically.

For future commits, pre-commit will run automatically on changed files before each commit.

- scripts/verify_anyflow_fastvideo_parity.py, scripts/demo_anyflow_14b.py: read ANYFLOW_LOCAL / ANYFLOW_REF / ANYFLOW_DEMO_OUT from the environment so the scripts are not tied to a single workstation layout. Drop the bespoke `srun` invocation example from the docstring in favour of a plain `python scripts/...` snippet. - fastvideo/train/methods/distribution_matching/anyflow_pretrain.py: collapse the diff_mean if/else into a ternary so ruff SIM108 is happy.

Enderfga · 2026-05-22T10:13:44Z

@SolitaryThinker Quick pass on Gob's three S3 items in ef69c6f:

S3.1 (arxiv link). Keeping arxiv.org/abs/2605.13724 as-is — that is the correct ID. AnyFlow was posted to arxiv this month (2026-05), so the prefix is 2605, not 2405. The gemini suggestion pattern-matched to a different paper from 2024 with the same numeric suffix; full context in the inline thread reply.

S3.2 (hardcoded paths in scripts/). Lifted the four Path(...) constants in scripts/verify_anyflow_fastvideo_parity.py and scripts/demo_anyflow_14b.py onto env vars (ANYFLOW_LOCAL, ANYFLOW_REF, ANYFLOW_DEMO_OUT) with neutral defaults, and replaced the workstation-specific srun invocation example in each docstring with a plain PYTHONPATH=$PWD python scripts/... snippet so the repo no longer ships any contributor-specific layout.

S3.3 (silent no-op when r_embedder=True and r_timestep=None). Agreed this is worth a logger.warning_once in WanModel.predict_noise for future-proofing. Tracking as a followup PR rather than landing it here, since touching predict_noise requires a fresh pass over the bit-identity tests to keep the guards meaningful — happy to open it right after this merges.

Also folded a ruff SIM108 fix in anyflow_pretrain.py (the only finding in this PR's own code from the latest pre-commit run; the rest of the failures are a repo-wide --all-files --hook-stage manual sweep reformatting upstream files).

Enderfga · 2026-05-22T15:25:07Z

@SolitaryThinker From my side this is ready to go — the path cleanup and the ruff SIM108 fix landed in ef69c6f, and the full Buildkite matrix is green on that commit (fastcheck, full-suite, SSIM, LoRA, all train-framework lanes).

The only thing blocking the mergify gate is the pre-commit GitHub Action, which is sitting in action_required status on the new commit — first-time-contributor workflows need a maintainer to click Approve and run in the Actions tab again after each push. If you can re-trigger it (and if it still trips on the repo-wide --all-files --hook-stage manual sweep over upstream files, same way it did on the previous green merge round), this PR is good to merge from my end.

Thanks for the patience on the back-and-forth.

mergify · 2026-05-22T16:26:11Z

Pre-commit checks failed

Hi @Enderfga, the pre-commit checks have failed. To fix them locally:

# Install pre-commit if you haven't already
uv pip install pre-commit
pre-commit install

# Run all checks and auto-fix what's possible
pre-commit run --all-files

Common fixes:

yapf: yapf -i <file> (formatting)
ruff: ruff check --fix <file> (linting)
codespell: codespell --write-changes <file> (spelling)

After fixing, commit and push the changes. The checks will re-run automatically.

For future commits, pre-commit will run automatically on changed files before each commit.

Enderfga added 14 commits May 19, 2026 14:20

[test] AnyFlow pretrain: cover (t, r) sampling distribution + validation

a6e84a9

github-actions Bot reviewed May 19, 2026

View reviewed changes

mergify Bot added type: feat New feature or capability scope: training Training pipeline, methods, configs scope: infra CI, tests, Docker, build scope: docs Documentation scope: model Model architecture (DiTs, encoders, VAEs) labels May 19, 2026

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

SolitaryThinker approved these changes May 22, 2026

View reviewed changes

github-actions Bot added the ready PR is ready to merge label May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add AnyFlow any-step video distillation (pretrain + on-policy)#1371

[feat] Add AnyFlow any-step video distillation (pretrain + on-policy)#1371
Enderfga wants to merge 15 commits into
hao-ai-lab:mainfrom
Enderfga:add-anyflow-pretrain-onpolicy

Enderfga commented May 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot left a comment

Uh oh!

mergify Bot commented May 19, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

Enderfga May 22, 2026

Uh oh!

SolitaryThinker commented May 22, 2026

Uh oh!

SolitaryThinker left a comment

Uh oh!

SolitaryThinker commented May 22, 2026

Uh oh!

mergify Bot commented May 22, 2026

Uh oh!

Enderfga commented May 22, 2026

Uh oh!

Enderfga commented May 22, 2026

Uh oh!

mergify Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,116 @@
		# 🌊 AnyFlow Any-Step Video Distillation

		AnyFlow ([paper](https://arxiv.org/abs/2605.13724), [project page](https://nvlabs.github.io/AnyFlow/), [official code](https://github.com/NVlabs/AnyFlow), [model weights](https://huggingface.co/collections/nvidia/anyflow)) is an any-step video diffusion framework built on flow maps. A single distilled checkpoint can be evaluated at NFE ∈ {1, 2, 4, 8, 16, 32} without retraining, and quality scales monotonically with steps — unlike consistency-based distillation, which often degrades as NFE grows.

Conversation

Enderfga commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Numerical verification

Sample videos

Test plan

Compatibility

Out of scope

Related PRs

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 PR merge requirements

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Enderfga May 22, 2026

Choose a reason for hiding this comment

Uh oh!

SolitaryThinker commented May 22, 2026

TL;DR

Verdict: approve-with-followup

What I checked

Three small polish items (all S3 — non-blocking)

1. arXiv link typo (2605 → 2405)

2. Hardcoded user paths in helper scripts

3. Silent no-op when r_embedder=True but caller forgets r_timestep (future-proofing)

Suggested merge path

Uh oh!

SolitaryThinker left a comment

Choose a reason for hiding this comment

Uh oh!

SolitaryThinker commented May 22, 2026

Uh oh!

mergify Bot commented May 22, 2026

Pre-commit checks failed

Uh oh!

Enderfga commented May 22, 2026

Uh oh!

Enderfga commented May 22, 2026

Uh oh!

mergify Bot commented May 22, 2026

Pre-commit checks failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Enderfga commented May 19, 2026 •

edited

Loading

mergify Bot commented May 19, 2026 •

edited

Loading

1. arXiv link typo (`2605` → `2405`)

3. Silent no-op when `r_embedder=True` but caller forgets `r_timestep` (future-proofing)