Context
The training script (scripts/train/droid_training_full_finetune_wan22.sh) explicitly sets:
num_frame_per_block=2
num_action_per_block=24
According to the VAE encoding logic in wan_flow_matching_action_tf.py, the recommended number of input frames for subsequent chunks is 4 × num_frame_per_block. With num_frame_per_block=2, this should be 8 frames.
However, the official test client and server code both use 4 frames per chunk.
The Discrepancy
| Component |
Frames per chunk |
Implied num_frame_per_block |
Training script (droid_training_full_finetune_wan22.sh:117) |
— |
2 (explicit override) |
Base YAML config (wan_flow_matching_action_tf.yaml:14) |
— |
1 (default) |
test_client_AR.py:52 |
4 (4 offsets) |
1 |
socket_test_optimized_AR.py:55 |
4 |
1 |
eval_utils/serve_dreamzero_wan22.py:73 |
4 |
1 (comment: "matches 5B num_frame_per_block") |
The official serving/testing code appears to be written against the YAML default (num_frame_per_block=1), not the actual training configuration (num_frame_per_block=2).
What happens when 4 frames are sent with num_frame_per_block=2
In the VAE encoding path (wan_flow_matching_action_tf.py:1108-1122), 4 frames triggers the repeat branch:
Input: T=4
Condition check:
(T-1)//4 = 0 ≠ 2 → skip
T//4 = 1 ≠ 2 → enters repeat branch
repeat_factor = num_frame_per_block // (T//4) = 2 // 1 = 2
Step 1: repeat_interleave(repeats=2, dim=2)
[f0, f1, f2, f3] → [f0, f0, f1, f1, f2, f2, f3, f3] (8 frames)
Step 2: prepend first frame
[f0, f0, f0, f1, f1, f2, f2, f3, f3] (9 frames = 4×2+1) ✓
This works but each frame is duplicated to fill the gap, resulting in redundant information in the VAE latents.
What happens when 8 frames are sent (recommended)
Input: T=8
Condition check:
(T-1)//4 = 1 ≠ 2 → skip
T//4 = 2 == 2 → enters prepend-only branch ✓
Step 1: prepend first frame
[f0, f0, f1, f2, f3, f4, f5, f6, f7] (9 frames = 4×2+1) ✓
All 8 frames carry unique information. The prepended f0 duplicate ends up in latent 0 which gets discarded anyway, so there is no information loss.
VAE latent comparison
4 frames (with repeat):
VAE input: [f0, f0, f0, f1, f1, f2, f2, f3, f3]
latent 0: f0 → discarded (no loss)
latent 1: f0, f1, f1 → f0 is redundant, f1 is duplicated
latent 2: f2, f2, f3, f3 → f2, f3 are duplicated
8 frames (prepend-only):
VAE input: [f0, f0, f1, f2, f3, f4, f5, f6, f7]
latent 0: f0 → discarded (no loss, was duplicate)
latent 1: f0, f1, f2, f3 → 4 unique frames
latent 2: f4, f5, f6, f7 → 4 unique frames
Questions
-
Is num_frame_per_block=2 the intended production configuration for the 5B model? If so, should the serving code (socket_test_optimized_AR.py, serve_dreamzero_wan22.py) be updated to use 8 frames per chunk?
-
Is the frame duplication via repeat_interleave an intentional fallback for clients that cannot provide enough frames, or is it a sign that the client should be sending more frames?
-
Does the frame duplication in the repeat branch noticeably degrade action prediction quality compared to sending 8 unique frames?
Context
The training script (
scripts/train/droid_training_full_finetune_wan22.sh) explicitly sets:According to the VAE encoding logic in
wan_flow_matching_action_tf.py, the recommended number of input frames for subsequent chunks is4 × num_frame_per_block. Withnum_frame_per_block=2, this should be 8 frames.However, the official test client and server code both use 4 frames per chunk.
The Discrepancy
num_frame_per_blockdroid_training_full_finetune_wan22.sh:117)wan_flow_matching_action_tf.yaml:14)test_client_AR.py:52socket_test_optimized_AR.py:55eval_utils/serve_dreamzero_wan22.py:73The official serving/testing code appears to be written against the YAML default (
num_frame_per_block=1), not the actual training configuration (num_frame_per_block=2).What happens when 4 frames are sent with
num_frame_per_block=2In the VAE encoding path (
wan_flow_matching_action_tf.py:1108-1122), 4 frames triggers the repeat branch:This works but each frame is duplicated to fill the gap, resulting in redundant information in the VAE latents.
What happens when 8 frames are sent (recommended)
All 8 frames carry unique information. The prepended
f0duplicate ends up in latent 0 which gets discarded anyway, so there is no information loss.VAE latent comparison
4 frames (with repeat):
8 frames (prepend-only):
Questions
Is
num_frame_per_block=2the intended production configuration for the 5B model? If so, should the serving code (socket_test_optimized_AR.py,serve_dreamzero_wan22.py) be updated to use 8 frames per chunk?Is the frame duplication via
repeat_interleavean intentional fallback for clients that cannot provide enough frames, or is it a sign that the client should be sending more frames?Does the frame duplication in the repeat branch noticeably degrade action prediction quality compared to sending 8 unique frames?