Skip to content

[feat] Add AnyFlow any-step video distillation (pretrain + on-policy)#1371

Open
Enderfga wants to merge 15 commits into
hao-ai-lab:mainfrom
Enderfga:add-anyflow-pretrain-onpolicy
Open

[feat] Add AnyFlow any-step video distillation (pretrain + on-policy)#1371
Enderfga wants to merge 15 commits into
hao-ai-lab:mainfrom
Enderfga:add-anyflow-pretrain-onpolicy

Conversation

@Enderfga
Copy link
Copy Markdown

@Enderfga Enderfga commented May 19, 2026

Summary

Adds AnyFlow (paper, official code, model weights) — an any-step video diffusion framework built on flow maps — as a first-class distillation method in FastVideo, covering both training stages.

A single distilled checkpoint can be evaluated at NFE ∈ {1, 2, 4, 8, 16, 32, 50} without retraining; quality scales monotonically with steps, unlike consistency-based distillation which often degrades at higher NFE. The student network u_θ(x_t, t, r) predicts the average velocity from time t back to time r, so one Euler step is

x_r = x_t - ((t - r) / N) · u_θ(x_t, t, r)

Two new methods on top of FastVideo's TrainingMethod framework:

  • AnyFlowPretrainMethod (fastvideo/train/methods/distribution_matching/anyflow_pretrain.py) — flow-map pretrain via central-difference target with (t, r) per-batch sampling (50% diffusion / 25% consistency / 25% free).
  • AnyFlowMethod (fastvideo/train/methods/distribution_matching/anyflow.py) — on-policy multi-step Euler-flow rollout subclassing DMD2Method. One randomly-chosen step is gradient-enabled (broadcast from rank 0); the rest run under torch.no_grad.

Supporting changes:

  • WanTimeTextImageEmbedding gains an optional dual-timestep branch (r_embedder=True → allocate a delta_embedder deep-copied from time_embedder + register a non-persistent gate buffer). Two fusion modes:
    • additive (default): temb_t + g · delta_emb
    • gated: (1 - g) · temb_t + g · delta_emb (matches AnyFlow's WanTwoTimeTextImageEmbedding)
    • The additive default with no r_timestep passed is byte-identical to the legacy path — every existing config (DMD2, Self-Forcing, KD, DFSFT) stays bit-equal.
  • FlowMapEulerDiscreteScheduler (fastvideo/models/schedulers/) — standalone any-step Euler scheduler with apply_shift, get_train_weight(beta08), step(model_output, sample, t, r), add_noise. No diffusers ConfigMixin dependency.
  • WanModel.predict_velocity_with_r — adds an r_timestep to the transformer kwargs through the same set_forward_context plumbing as predict_noise.
  • WanVideoArchConfig.param_names_mapping gets two regex entries that rename condition_embedder.delta_embedder.*condition_embedder.delta_embedder.mlp.fc_{in,out}.* so the published nvidia/AnyFlow-Wan2.1-T2V-*-Diffusers checkpoints load as-is through FastVideo's existing loader pipeline. Both regex are no-ops on plain Wan checkpoints.

Two reference YAMLs under examples/train/configs/distribution_matching/wan/:

  • anyflow_pretrain_t2v.yaml — Wan-T2V-1.3B pretrain, r_embedder: true, gate=0.25, shift=5.0, epsilon=5, weight_type=beta08, fuse_guidance_scale=3.0, lr 5e-5, 6k steps.
  • anyflow_onpolicy_t2v.yaml — Wan-T2V-1.3B on-policy with Wan-14B teacher, dmd_denoising_steps=[999, 937, 833, 624], t_list_override=[999, 937, 833, 624, 0], student_sample_steps=4, use_mean_velocity=true, lr 2e-6.

Algorithm + usage write-up at docs/distillation/anyflow.md (registered in mkdocs.yml nav).

Numerical verification

End-to-end parity verified on a single H200 against NVlabs/AnyFlow's reference loader on the published 1.3B checkpoint (nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers). Repro: scripts/verify_anyflow_fastvideo_parity.py.

Check Metric Result
Checkpoint key remap missing / unexpected 0 / 0
Forward parity mean abs diff 1.00e-2 (bf16)
Forward parity max abs diff 7.81e-2 (bf16)
Forward parity rel mean diff 2.55%
4-step Euler-flow sampling final latent stats finite, std=0.799, range [-3.97, +3.91]
Training-step loss AnyFlow reference 0.381619
Training-step loss FastVideo port 0.386694
Training-step loss rel diff 1.33%

For comparison, the sister FastGen port (NVlabs/FastGen#25) reports 2.8% forward / 4.07% training-loss on the same checkpoint — FastVideo's slightly tighter result is consistent with its attention/normalization implementation having marginally lower kernel noise on H200.

Sample videos

Generated end-to-end through FastVideo with the new FlowMapEulerDiscreteScheduler: nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers, 81 frames @ 480×832, seed=0, shift=5.0, guidance_scale=1.0 (no CFG — the on-policy distilled checkpoint has fuse_guidance_scale=3.0 baked into the weights, matching AnyFlow's official demo.py default).

Same prompt as AnyFlow's demo.py (the "majestic elephant running towards a herd" prompt). Same model, two NFE settings:

  • 14B NFE=4 (~30 s on a single H200):
fastvideo_anyflow_14b_nfe4_v1.mp4
  • 14B NFE=50 (~5 min on a single H200):
fastvideo_anyflow_14b_nfe50_v1.mp4

Quality scales monotonically with NFE — NFE=4 is already coherent, NFE=50 produces sharper textures and more stable motion.

Test plan

  • pytest fastvideo/tests/training/distill/test_anyflow_{pretrain,onpolicy}.py — 43 CPU unit tests (Xavier-init helper for ReplicatedLinear so the embedder forward produces finite outputs in CPU mode); covers (t, r) sampling, central-difference target math, scheduler numerics, embedder bit-identity in additive mode, on-policy rollout gradient masking, t_list_override validation, checkpoint key remap.
  • Forward + training-step + sampling parity on Wan2.1-T2V-1.3B — see table above; script at scripts/verify_anyflow_fastvideo_parity.py.
  • 14B inference end-to-end at NFE=4 and NFE=50 — see videos above; script at scripts/demo_anyflow_14b.py.
  • /test distillation — GPU smoke on Buildkite (pretrain + on-policy, 2 iters each via fastvideo/tests/training/distill/test_anyflow_smoke.py).

Compatibility

The default r_embedder=False arch config keeps every existing Wan-based method byte-identical to main — no delta_embedder allocated, no extra forward computation, no state_dict additions. Existing CI (DMD2, Self-Forcing, KD, DFSFT) should be unaffected.

Out of scope

  • The FAR causal variant from the paper. The bidirectional Wan-T2V backbone supports the full any-step recipe; the FAR variant can be added in a follow-up PR if there's interest.
  • LoRA-only training mode. AnyFlow's reference repo supports LoRA adapters on the student / real_score / discriminator. Wiring LoRA into FastVideo's role models cleanly is a focused follow-up rather than something to wedge into the core algorithm port.

Related PRs

Enderfga added 14 commits May 19, 2026 14:20
…tity)

Adds four fields to WanVideoArchConfig for AnyFlow dual-timestep conditioning:
- r_embedder: bool — enables the delta_embedder allocation
- r_embedder_fusion: 'additive' (default) or 'gated'
- r_embedder_gate_value: float — gate g in gated fusion
- r_embedder_deltatime_type: 'r' (default) or 't-r'

Also adds two regex entries to param_names_mapping so HF AnyFlow checkpoints
load delta_embedder weights into the FastVideo-internal mlp.fc_in/fc_out layout.
Both regex are no-ops on plain Wan checkpoints.

The defaults preserve bit-identity with every existing Wan-based method
(DMD2, Self-Forcing, KD, DFSFT) — no delta_embedder is allocated and no
extra computation runs on the embedder forward path.
When r_embedder=True, allocates a delta_embedder (deep-copied from
time_embedder for matching initialization, per the AnyFlow reference's
setup_flowmap_model() pattern) and registers a non-persistent gate buffer.

Forward now accepts an optional r_timestep:
- additive (default): temb_t + g * delta_emb
- gated: (1 - g) * temb_t + g * delta_emb

Both branches are gated by r_embedder=True AND r_timestep is not None,
so existing call sites that don't pass r_timestep stay byte-identical
(verified by test_embedder_enabled_without_r_timestep_is_bit_identical_to_legacy).

The gate is a non-persistent buffer — it's a hyperparameter, not a
learned weight, and stays out of state_dict so checkpoints remain
portable across gate values.

deltatime_type controls whether the delta_embedder consumes r directly
(default, matching AnyFlow's deltatime_type='r') or (t - r).
The forward signature now accepts r_timestep as an explicit kwarg (not
silently swallowed by **kwargs), gets flattened in parallel with timestep
when the input is 2-D (Wan 2.2 ti2v case), and is forwarded to the
condition_embedder.

The constructor now propagates the four arch_config flags
(r_embedder/r_embedder_fusion/r_embedder_gate_value/r_embedder_deltatime_type)
into the WanTimeTextImageEmbedding so a YAML can opt into AnyFlow's
dual-timestep conditioning through training.dit_config overrides.

Defaults preserve byte-identity with the legacy single-timestep path —
when r_timestep is omitted the delta_embedder branch is skipped entirely
(test_embedder_enabled_without_r_timestep_is_bit_identical_to_legacy).
Standalone any-step Euler scheduler implementing AnyFlow's flow-map step
formula x_r = x_t - ((t - r) / num_train_timesteps) * u(x_t, t, r). Also
provides the AnyFlow training-time helpers apply_shift (flow-matching
shift transform), get_train_weight (per-timestep loss weight with
beta08 = t * sqrt(1-t) renormalized to num_train_timesteps total mass),
and add_noise (linear flow-matching interpolation).

Subclasses BaseScheduler so it slots into the existing FastVideo
scheduler discovery surface (timesteps/order/num_train_timesteps
attributes plus set_shift/set_timesteps/scale_model_input). Has no
diffusers ConfigMixin/SchedulerMixin dependency.

set_timesteps accepts custom_timesteps so configs can pin AnyFlow's
hand-tuned schedules (e.g. [999, 937, 833, 624, 0] for the 4-step
Wan2.1 setting from the paper).
Adds the single-student AnyFlow flow-map pretrain method as a
TrainingMethod subclass. The __init__ parses and validates all method
config knobs (diffusion_ratio, consistency_ratio, epsilon, weight_type,
fuse_guidance_scale, shift), builds a FlowMapEulerDiscreteScheduler, and
wires a single optimizer+scheduler over the student.

The (t, r) sampling helper _sample_pair_timesteps lives at module scope
so it can be unit-tested without instantiating the full method. It
matches the AnyFlow paper formulation:

  - two uniform draws u1, u2; t = max, r = min
  - first diffusion_ratio * B: r := t (plain flow matching)
  - next consistency_ratio * B: r := 0 (consistency to clean data)
  - remainder: free reconstruction range

single_train_step raises NotImplementedError for now — filled in by the
next commit which adds the central-difference target and the loss
assembly (Task 7).
…n_step

Implements AnyFlow's central-difference loss path end-to-end:

1. WanModel gains predict_velocity_with_r(noisy, t, r, batch, ...) — mirrors
   predict_noise's forward_context + autocast plumbing but injects
   r_timestep into the transformer kwargs.

2. anyflow_pretrain._central_difference_dF_dt — symmetric finite difference
   over (t ± delta) with the sample also moved along the flow trajectory by
   v_pred * (delta / num_train_timesteps), mirroring AnyFlow's reference
   trainer_wan_anyflow_pretrain.py::compute_central_difference. Wrapped in
   torch.no_grad so the two extra forwards stay out of the backward graph.

3. AnyFlowPretrainMethod.single_train_step:
   - sample (t, r) ∈ [0, 1] via _sample_pair_timesteps
   - apply scheduler shift + scale to absolute units
   - rebuild noisy with the flow-map scheduler's add_noise
   - run conditional + (optional) unconditional student forwards
   - apply guidance distillation when fuse_guidance_scale != 1
   - compute target = (eps - x0) - (t - r) * dF/dt
   - per-sample MSE * per-timestep weight
   - stop-grad scale balance so non-diffusion losses match the diffusion
     branch's magnitude
   - emit metrics for diffusion/consistency fractions + scale weight mean

4. AnyFlowPretrainMethod.backward overrides TrainingMethod.backward to
   route the loss through self.student.backward inside the correct
   forward_context, matching DMD2Method's pattern.
Subclasses DMD2Method and overrides _student_rollout to run
student_sample_steps Euler-flow steps from pure noise. One step is
gradient-enabled, with the chosen index broadcast from rank 0 in
distributed runs so every worker agrees (matching AnyFlow's
WanAnyFlowPipeline.training_rollout).

Method config knobs:
- student_sample_steps (default 4)
- use_mean_velocity (default True; r = t_next vs r = t at each step)
- t_list_override (optional pinned schedule, e.g. AnyFlow paper's
  [999, 937, 833, 624, 0] for 4-step Wan2.1)
- dmd_score_r_value (default 0.0; r used by the inherited DMD loss)
- real_score_guidance_scale (inherited from DMD2; default 1.0)

The inherited _dmd_loss / _critic_flow_matching_loss machinery is reused
verbatim — they call predict_x0 / predict_noise on the student, which
go through the single-timestep WanModel forward path (since DMD2 scores
against r=0 implicitly via the inherited code; if r=t is desired this
can be wired in a follow-up by overriding _dmd_loss to call
predict_velocity_with_r).

CPU-only tests via object.__new__ bypassing the full __init__ wire-up;
GPU integration coverage lives in the smoke test (Task 11).
Two reference configs slot into examples/train/configs/distribution_matching/wan/:

- anyflow_pretrain_t2v.yaml: single-student pretrain stage. Enables
  r_embedder under pipeline.dit_config, sets diffusion/consistency
  ratios, epsilon=5, weight_type=beta08, fuse_guidance_scale=3.0, lr 5e-5,
  6k steps, Wan2.1 T2V 1.3B init.

- anyflow_onpolicy_t2v.yaml: full DMD2 trio (student + Wan2.1 14B teacher
  + Wan2.1 1.3B critic). Pins the 4-step rollout schedule
  [999, 937, 833, 624, 0] from the AnyFlow paper, use_mean_velocity=true,
  generator_update_interval=5, lr 2e-6, 4k steps. The student loads from
  a path placeholder which can be either the local pretrain output or
  the published nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers checkpoint —
  the latter case relies on the param_names_mapping regex in
  WanVideoArchConfig to rename delta_embedder weights.
Mirrors test_distill_dmd.py's subprocess-torchrun pattern but points at
the new YAML entrypoint (fastvideo.train.entrypoint.train). Auto-skipped
when fewer than 2 CUDA devices are visible — the new TrainingMethod
framework's distributed barriers don't tolerate single-GPU bring-up.

On Buildkite the test fires under /test distillation; on GMI we can
run it directly with NUM_GPUS=2 NUM_NODES=1 pytest -v -k anyflow_smoke.
Single-page summary of AnyFlow's two-stage recipe:
- Algorithm intuition (u_θ(x_t, t, r) → any-step Euler)
- Stage 1 pretrain: central-difference target + (t, r) sampling
- Stage 2 on-policy DMD: multi-step rollout with grad-step broadcast
- Launch commands for both YAMLs
- Note on loading nvidia/AnyFlow-Wan2.1-T2V-* checkpoints (handled by
  param_names_mapping, no separate adapter)
- Note on fuse_guidance_scale parameterization

Registered under mkdocs nav so it surfaces in the built doc tree.
…r CPU runs

FastVideo's ReplicatedLinear allocates weights via torch.empty and
relies on a downstream load_weights pass to populate them. CPU unit
tests bypass that pass, so weights start as NaN/Inf and the embedder
forward produces NaN everywhere. Add _init_uninitialized_weights that
Xavier-init's every >=2D param and zeros the rest before .eval().

Verified on GMI (gpu-h200-06 with 8x H200): all 43 tests pass.
Five-stage end-to-end verification, run via single-rank torchrun-less
srun on a single H200:

(1) Build FastVideo WanTransformer3DModel with r_embedder=True,
    r_embedder_fusion=gated, gate=0.25.
(2) Load nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers safetensors and
    translate keys via WanVideoArchConfig.param_names_mapping
    (0 missing / 0 unexpected — the delta_embedder regex is sufficient).
(3) Build AnyFlow's reference loader (FAR_Wan_Transformer3DModel).
(4) Forward parity on identical inputs — bf16 noise.
(5) 4-step Euler-flow sampling smoke via FlowMapEulerDiscreteScheduler.
(6) Training-step central-difference loss comparison (inline replica
    of AnyFlow's train_bidirection).

Measured on Wan2.1-T2V-1.3B + nvidia/AnyFlow checkpoint:
  forward rel mean diff : 2.55%
  forward max abs diff  : 7.81e-2
  training loss diff    : 1.33% (AnyFlow 0.381619 vs FastVideo 0.386694)

Both within bf16 kernel noise. Compare to the FastGen port at
NVlabs/FastGen#25 which reported 2.8% forward + 4.07% training-loss
on the same checkpoint — FastVideo's tighter result is consistent
with FastVideo's attention/normalization implementation having slightly
lower kernel noise on H200 than FastGen's.
Single-rank demo script that loads
nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers into FastVideo's
WanTransformer3DModel via param_names_mapping, then samples 81 frames
at 480x832 with the new FlowMapEulerDiscreteScheduler at both NFE=4
(~30s) and NFE=50 (~5min) and decodes each via the Wan VAE (tiling
enabled). Uses guidance_scale=1.0 since the on-policy distilled
checkpoint has fuse_guidance_scale=3.0 baked into the weights, matching
the AnyFlow paper's official demo.py default.

Memory tactics for single H200 (141 GB HBM):
- Encode prompts with UMT5 first, free the text encoder.
- Build 14B transformer (~28 GB bf16), load AnyFlow shards.
- Sample at both NFEs, free transformer, then VAE decode with tiling.

Peak GPU usage ~57 GB on a single H200.
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome to FastVideo! Thanks for your first pull request.

How our CI works:

PRs run a two-tier CI system:

  1. Pre-commit — formatting (yapf), linting (ruff), type checking (mypy). Runs immediately on every PR.
  2. Fastcheck — core GPU tests (encoders, VAEs, transformers, kernels, unit tests). Runs automatically via Buildkite on relevant file changes (~10-15 min).
  3. Full Suite — integration tests, training pipelines, SSIM regression. Runs only when a reviewer adds the ready label.

Before your PR is reviewed:

  • pre-commit run --all-files passes locally
  • You've added or updated tests for your changes
  • The PR description explains what and why

If pre-commit fails, a bot comment will explain how to fix it. Fastcheck and Full Suite results appear in the Checks section below.

Useful links:

@mergify mergify Bot added type: feat New feature or capability scope: training Training pipeline, methods, configs scope: infra CI, tests, Docker, build scope: docs Documentation scope: model Model architecture (DiTs, encoders, VAEs) labels May 19, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 19, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 PR merge requirements

Waiting for

  • check-success~=pre-commit
This rule is failing.
  • check-success~=pre-commit
  • #approved-reviews-by>=1
  • check-success=fastcheck-passed
  • check-success=full-suite-passed
  • title~=(?i)^\[(feat|feature|bugfix|fix|refactor|perf|ci|doc|docs|misc|chore|kernel|new.?model|skill|skills|infra)\]

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the AnyFlow any-step video distillation framework, adding dual-timestep conditioning to the Wan model, a new FlowMapEulerDiscreteScheduler, and training methods for pre-training and on-policy distillation. The PR includes comprehensive tests and documentation. A review comment correctly identified and provided a fix for a typo in the arXiv link for the AnyFlow paper.

@@ -0,0 +1,116 @@
# 🌊 AnyFlow Any-Step Video Distillation

**AnyFlow** ([paper](https://arxiv.org/abs/2605.13724), [project page](https://nvlabs.github.io/AnyFlow/), [official code](https://github.com/NVlabs/AnyFlow), [model weights](https://huggingface.co/collections/nvidia/anyflow)) is an any-step video diffusion framework built on flow maps. A single distilled checkpoint can be evaluated at NFE ∈ {1, 2, 4, 8, 16, 32} without retraining, and quality scales **monotonically** with steps — unlike consistency-based distillation, which often degrades as NFE grows.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The link to the AnyFlow paper appears to have a typo. The year-month part 2605 is likely incorrect for a 2024 paper. The correct link should probably point to arxiv.org/abs/2405.13724.

Suggested change
**AnyFlow** ([paper](https://arxiv.org/abs/2605.13724), [project page](https://nvlabs.github.io/AnyFlow/), [official code](https://github.com/NVlabs/AnyFlow), [model weights](https://huggingface.co/collections/nvidia/anyflow)) is an any-step video diffusion framework built on flow maps. A single distilled checkpoint can be evaluated at NFE ∈ {1, 2, 4, 8, 16, 32} without retraining, and quality scales **monotonically** with steps — unlike consistency-based distillation, which often degrades as NFE grows.
**AnyFlow** ([paper](https://arxiv.org/abs/2405.13724), [project page](https://nvlabs.github.io/AnyFlow/), [official code](https://github.com/NVlabs/AnyFlow), [model weights](https://huggingface.co/collections/nvidia/anyflow)) is an any-step video diffusion framework built on flow maps. A single distilled checkpoint can be evaluated at NFE ∈ {1, 2, 4, 8, 16, 32} without retraining, and quality scales **monotonically** with steps — unlike consistency-based distillation, which often degrades as NFE grows.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arxiv ID is correct as written. AnyFlow was posted to arxiv this month (2026-05), so the prefix is 2605, not 2405. https://arxiv.org/abs/2605.13724 resolves to the right paper; https://arxiv.org/abs/2405.13724 is a different (2024) submission that this suggestion has been pattern-matched to.

@SolitaryThinker
Copy link
Copy Markdown
Collaborator

Hi @Enderfga — this is a code review from one of @SolitaryThinker's AI reviewer agents (Gob). I run these to help triage PRs but @SolitaryThinker hasn't personally verified every finding. If anything below doesn't match what you know about the code, please ping @SolitaryThinker — they'll take a closer look.

TL;DR

The PR is in unusually good shape for its size (2,932 LoC). Bit-identity defaults are properly guarded and explicitly tested, numerical parity vs the upstream NVlabs/AnyFlow reference is asserted with real thresholds (2.55% / 1.33% rel-diff on bf16), distributed gradient-step broadcast is correct, scheduler conforms to BaseScheduler ABC, all 14 commits are clean of AI co-author trailers. Three minor S3 polish items only.

Verdict: approve-with-followup

  • S0 (blockers): 0
  • S1 (must-fix): 0
  • S2 (should-fix): 0
  • S3 (discussion): 3 — surfaced below since all findings are S3 and none are blocking; see review.md for full detail.

What I checked

Default-path bit-identity (the central claim). The r_embedder=False default is properly guarded everywhere:

  • WanTimeTextImageEmbedding.__init__ sets self.delta_embedder = None as a plain Python attribute (not a registered submodule) — no state-dict entry.
  • The _r_embedder_gate buffer is only registered inside the if self._r_embedder_enabled branch (and registered with persistent=False, so AnyFlow checkpoints stay portable across different gate hyperparameters).
  • The new r_timestep is not None and r_timestep.dim() == 2 flatten branch in WanTransformer3DModel.forward is a no-op when r_timestep=None.
  • The embedder's delta branch is gated on _r_embedder_enabled AND r_timestep is not None.

The four explicit tests pin this down:

  • test_wan_arch_defaults_preserve_bit_identity
  • test_embedder_default_path_no_delta_module
  • test_embedder_default_path_is_bit_identical_to_legacy
  • test_embedder_enabled_without_r_timestep_is_bit_identical_to_legacy

Existing Wan-based pipelines (DMD2, Self-Forcing, KD, DFSFT, base T2V/I2V) stay byte-equal to main. ✓

Numerical parity (scripts/verify_anyflow_fastvideo_parity.py). Real assertions, not shape-only:

  • Forward rel-mean diff < 10% (bf16) → measured 2.55%.
  • Training-step loss rel-diff < 20% → measured 1.33%.
  • 4-step Euler-flow sampling produces finite latents with std=0.799.
  • Strict-load on nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers: missing/unexpected = 0/0.

The sister FastGen port (NVlabs/FastGen#25) reporting 2.8% / 4.07% on the same checkpoint is a nice external sanity check.

Distributed correctness (gradient-step broadcast). AnyFlowMethod._broadcast_grad_step_index:

  • Rank 0 generates the random index; non-zero ranks allocate uninitialized memory of matching shape/dtype/device.
  • dist.broadcast(src=0) is called on all ranks before the .item() read.
  • Single-rank path falls through to local rand generation safely.

All ranks converge on the same grad_step before the rollout loop begins. ✓

Scheduler conformance. FlowMapEulerDiscreteScheduler correctly implements every abstract method of BaseScheduler (set_shift, set_timesteps, scale_model_input) plus the AnyFlow-specific step / add_noise / apply_shift / get_train_weight. The choice to skip diffusers' SchedulerMixin + ConfigMixin (unlike FlowUniPCMultistepScheduler) is documented in the module docstring and reasonable since this scheduler is only consumed by the AnyFlow training methods, not by a diffusers pipeline.

AGENTS.md compliance. New method goes under fastvideo/train/methods/distribution_matching/ (modern stack, not legacy fastvideo/training/). AnyFlowMethod subclasses DMD2Method (composition, not fork). Teacher/critic injected through role_models dict. ✓

Commit hygiene. Scanned all 14 commits — no AI co-author trailers, no "Generated with Claude Code" lines, no Claude / Opus / Sonnet / Anthropic / OpenAI / GPT / Codex strings. All subjects use the [prefix]: convention and are <72 chars. ✓

Test coverage. ~40 CPU unit tests across test_anyflow_pretrain.py (604 LoC) + test_anyflow_onpolicy.py (231 LoC) genuinely exercise the math (_StubStudent analytical central-difference check, scheduler formula assertions, sampling distribution partition counts). test_anyflow_smoke.py (130 LoC, GPU-gated) runs the torchrun entrypoint for 2 iters on both YAMLs.


Three small polish items (all S3 — non-blocking)

1. arXiv link typo (26052405)

docs/distillation/anyflow.md:3 and the PR body link to arxiv.org/abs/2605.13724 — should be 2405.13724. @gemini-code-assist already left an inline suggestion you can accept with one click. Worth updating the PR body too so the two strings don't drift.

2. Hardcoded user paths in helper scripts

scripts/verify_anyflow_fastvideo_parity.py and scripts/demo_anyflow_14b.py hardcode /home/guian/projects/anyflow/anyflow-{1.3b,ref,14b} and /home/guian/projects/anyflow/demo_videos. The srun --jobid=304 … docstring also references author-specific cluster state.

These are one-off reproducibility scripts, not a CLI surface, so this isn't a blocker. But a reviewer or downstream user running them will silently hit FileNotFoundError. Minimal fix: lift the four Path(...) constants to os.environ.get("ANYFLOW_LOCAL", "..."), or add a tiny scripts/README_anyflow.md documenting the expected layout. Either is fine.

3. Silent no-op when r_embedder=True but caller forgets r_timestep (future-proofing)

The docstring of WanModel.predict_velocity_with_r correctly calls out the silent fallback: "otherwise r_timestep is silently ignored by the embedder and the forward reduces to the single-timestep path." The same applies inside WanTimeTextImageEmbedding.forward — if _r_embedder_enabled=True but r_timestep=None, the delta branch is skipped without warning.

This is intentional (the whole bit-identity-preservation mechanism depends on it) and tested (test_embedder_enabled_without_r_timestep_is_bit_identical_to_legacy). However, if a future code-path uses predict_noise (rather than predict_velocity_with_r) on an AnyFlow-trained model, it will silently produce single-timestep predictions — fine for sanity but easy to misuse.

Optional followup (not blocking this PR): emit a one-shot logger.warning_once inside WanModel.predict_noise when self.transformer.condition_embedder._r_embedder_enabled is True, so future callers know they're using the wrong entrypoint for an AnyFlow checkpoint.


Suggested merge path

Accept gemini's inline arXiv-typo suggestion → wait for /test distillation Buildkite lane to go green → squash merge. The other two items can land as a followup PR or be dropped at your discretion.

The explicit bit-identity tests + the real numerical-parity gate script (with assertion thresholds, not just shape checks) made this PR substantially easier to verify than typical model-additions PRs of this size.


Review from @SolitaryThinker's agent Gob (an AI reviewer). Ping @SolitaryThinker if any finding is off, contradicts your intent, or applies to a stale rebase. Full review (including verification log) is archived locally and available on request.

Copy link
Copy Markdown
Collaborator

@SolitaryThinker SolitaryThinker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution!

@SolitaryThinker
Copy link
Copy Markdown
Collaborator

/merge

@github-actions github-actions Bot added the ready PR is ready to merge label May 22, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 22, 2026

Pre-commit checks failed

Hi @Enderfga, the pre-commit checks have failed. To fix them locally:

# Install pre-commit if you haven't already
uv pip install pre-commit
pre-commit install

# Run all checks and auto-fix what's possible
pre-commit run --all-files

Common fixes:

  • yapf: yapf -i <file> (formatting)
  • ruff: ruff check --fix <file> (linting)
  • codespell: codespell --write-changes <file> (spelling)

After fixing, commit and push the changes. The checks will re-run automatically.

For future commits, pre-commit will run automatically on changed files before each commit.

- scripts/verify_anyflow_fastvideo_parity.py, scripts/demo_anyflow_14b.py:
  read ANYFLOW_LOCAL / ANYFLOW_REF / ANYFLOW_DEMO_OUT from the
  environment so the scripts are not tied to a single workstation
  layout. Drop the bespoke `srun` invocation example from the docstring
  in favour of a plain `python scripts/...` snippet.
- fastvideo/train/methods/distribution_matching/anyflow_pretrain.py:
  collapse the diff_mean if/else into a ternary so ruff SIM108 is
  happy.
@Enderfga
Copy link
Copy Markdown
Author

@SolitaryThinker Quick pass on Gob's three S3 items in ef69c6f:

S3.1 (arxiv link). Keeping arxiv.org/abs/2605.13724 as-is — that is the correct ID. AnyFlow was posted to arxiv this month (2026-05), so the prefix is 2605, not 2405. The gemini suggestion pattern-matched to a different paper from 2024 with the same numeric suffix; full context in the inline thread reply.

S3.2 (hardcoded paths in scripts/). Lifted the four Path(...) constants in scripts/verify_anyflow_fastvideo_parity.py and scripts/demo_anyflow_14b.py onto env vars (ANYFLOW_LOCAL, ANYFLOW_REF, ANYFLOW_DEMO_OUT) with neutral defaults, and replaced the workstation-specific srun invocation example in each docstring with a plain PYTHONPATH=$PWD python scripts/... snippet so the repo no longer ships any contributor-specific layout.

S3.3 (silent no-op when r_embedder=True and r_timestep=None). Agreed this is worth a logger.warning_once in WanModel.predict_noise for future-proofing. Tracking as a followup PR rather than landing it here, since touching predict_noise requires a fresh pass over the bit-identity tests to keep the guards meaningful — happy to open it right after this merges.

Also folded a ruff SIM108 fix in anyflow_pretrain.py (the only finding in this PR's own code from the latest pre-commit run; the rest of the failures are a repo-wide --all-files --hook-stage manual sweep reformatting upstream files).

@Enderfga
Copy link
Copy Markdown
Author

@SolitaryThinker From my side this is ready to go — the path cleanup and the ruff SIM108 fix landed in ef69c6f, and the full Buildkite matrix is green on that commit (fastcheck, full-suite, SSIM, LoRA, all train-framework lanes).

The only thing blocking the mergify gate is the pre-commit GitHub Action, which is sitting in action_required status on the new commit — first-time-contributor workflows need a maintainer to click Approve and run in the Actions tab again after each push. If you can re-trigger it (and if it still trips on the repo-wide --all-files --hook-stage manual sweep over upstream files, same way it did on the previous green merge round), this PR is good to merge from my end.

Thanks for the patience on the back-and-forth.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 22, 2026

Pre-commit checks failed

Hi @Enderfga, the pre-commit checks have failed. To fix them locally:

# Install pre-commit if you haven't already
uv pip install pre-commit
pre-commit install

# Run all checks and auto-fix what's possible
pre-commit run --all-files

Common fixes:

  • yapf: yapf -i <file> (formatting)
  • ruff: ruff check --fix <file> (linting)
  • codespell: codespell --write-changes <file> (spelling)

After fixing, commit and push the changes. The checks will re-run automatically.

For future commits, pre-commit will run automatically on changed files before each commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready PR is ready to merge scope: docs Documentation scope: infra CI, tests, Docker, build scope: model Model architecture (DiTs, encoders, VAEs) scope: training Training pipeline, methods, configs type: feat New feature or capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants