Skip to content

Question: Why does test_client_AR.py send 4 frames per chunk instead of 8? #46

@l1jiahao

Description

@l1jiahao

Context

The training script (scripts/train/droid_training_full_finetune_wan22.sh) explicitly sets:

num_frame_per_block=2
num_action_per_block=24

According to the VAE encoding logic in wan_flow_matching_action_tf.py, the recommended number of input frames for subsequent chunks is 4 × num_frame_per_block. With num_frame_per_block=2, this should be 8 frames.

However, the official test client and server code both use 4 frames per chunk.

The Discrepancy

Component Frames per chunk Implied num_frame_per_block
Training script (droid_training_full_finetune_wan22.sh:117) 2 (explicit override)
Base YAML config (wan_flow_matching_action_tf.yaml:14) 1 (default)
test_client_AR.py:52 4 (4 offsets) 1
socket_test_optimized_AR.py:55 4 1
eval_utils/serve_dreamzero_wan22.py:73 4 1 (comment: "matches 5B num_frame_per_block")

The official serving/testing code appears to be written against the YAML default (num_frame_per_block=1), not the actual training configuration (num_frame_per_block=2).

What happens when 4 frames are sent with num_frame_per_block=2

In the VAE encoding path (wan_flow_matching_action_tf.py:1108-1122), 4 frames triggers the repeat branch:

Input: T=4

Condition check:
  (T-1)//4 = 0 ≠ 2  → skip
  T//4     = 1 ≠ 2  → enters repeat branch

repeat_factor = num_frame_per_block // (T//4) = 2 // 1 = 2

Step 1: repeat_interleave(repeats=2, dim=2)
  [f0, f1, f2, f3] → [f0, f0, f1, f1, f2, f2, f3, f3]  (8 frames)

Step 2: prepend first frame
  [f0, f0, f0, f1, f1, f2, f2, f3, f3]  (9 frames = 4×2+1) ✓

This works but each frame is duplicated to fill the gap, resulting in redundant information in the VAE latents.

What happens when 8 frames are sent (recommended)

Input: T=8

Condition check:
  (T-1)//4 = 1 ≠ 2  → skip
  T//4     = 2 == 2  → enters prepend-only branch ✓

Step 1: prepend first frame
  [f0, f0, f1, f2, f3, f4, f5, f6, f7]  (9 frames = 4×2+1) ✓

All 8 frames carry unique information. The prepended f0 duplicate ends up in latent 0 which gets discarded anyway, so there is no information loss.

VAE latent comparison

4 frames (with repeat):

VAE input: [f0, f0, f0, f1, f1, f2, f2, f3, f3]
  latent 0: f0              → discarded (no loss)
  latent 1: f0, f1, f1      → f0 is redundant, f1 is duplicated
  latent 2: f2, f2, f3, f3  → f2, f3 are duplicated

8 frames (prepend-only):

VAE input: [f0, f0, f1, f2, f3, f4, f5, f6, f7]
  latent 0: f0              → discarded (no loss, was duplicate)
  latent 1: f0, f1, f2, f3  → 4 unique frames
  latent 2: f4, f5, f6, f7  → 4 unique frames

Questions

  1. Is num_frame_per_block=2 the intended production configuration for the 5B model? If so, should the serving code (socket_test_optimized_AR.py, serve_dreamzero_wan22.py) be updated to use 8 frames per chunk?

  2. Is the frame duplication via repeat_interleave an intentional fallback for clients that cannot provide enough frames, or is it a sign that the client should be sending more frames?

  3. Does the frame duplication in the repeat branch noticeably degrade action prediction quality compared to sending 8 unique frames?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions