Hi authors,
Thank you for sharing this great work. I am currently working on reproducing the data selection pipeline from "Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories (XMAS)" to include it as a baseline comparison in my ongoing research.
While inspecting the proxy training script (XMAS/scripts/train/finetune.sh), I noticed the step-based checkpointing (--save_steps 200). In Section 5 (Training details) of the paper, it is mentioned that training the proxy model for one epoch results in a total of $T = 7$ checkpoints.
However, since the dataset sizes differ significantly between LLaVA 665k (665k instances) and Vision FLAN (186k instances), the total training steps for 1 epoch vary greatly. Because of this, the exact checkpoint steps used to form the alignment trajectories—especially for Vision FLAN—seem to be unspecified in the text.
To ensure a faithful reproduction, could you clarify the following details?
- Exact Checkpoint Sequence and Intervals:
Could you specify the exact checkpoint steps (intervals) used to construct the trajectory for each dataset?
-
For LLaVA 665k: Which exact steps form the $T=7$ sequence (e.g.,
[step-200, step-400, ..., step-1400])?
-
For Vision FLAN: How many checkpoints ($T$) were used in total, and what was the exact sequence of step intervals used for trajectory extraction?
- Request for Selected Data (JSON files):
If possible, could you release the final selected data files (e.g., the JSON files containing the chosen subsets for Vision FLAN and LLaVA 665k)? Having access to the exact data subsets used in your experiments would serve as a crucial ground-truth reference to verify that our reproduced pipeline is functioning exactly as intended.
These details would greatly help in properly adapting XMAS. Thank you very much for your time and guidance!
Best regards,
Hyungwook
Hi authors,
Thank you for sharing this great work. I am currently working on reproducing the data selection pipeline from "Data Selection for Fine-tuning Vision Language Models via Cross Modal Alignment Trajectories (XMAS)" to include it as a baseline comparison in my ongoing research.
While inspecting the proxy training script ($T = 7$ checkpoints.
XMAS/scripts/train/finetune.sh), I noticed the step-based checkpointing (--save_steps 200). In Section 5 (Training details) of the paper, it is mentioned that training the proxy model for one epoch results in a total ofHowever, since the dataset sizes differ significantly between LLaVA 665k (665k instances) and Vision FLAN (186k instances), the total training steps for 1 epoch vary greatly. Because of this, the exact checkpoint steps used to form the alignment trajectories—especially for Vision FLAN—seem to be unspecified in the text.
To ensure a faithful reproduction, could you clarify the following details?
Could you specify the exact checkpoint steps (intervals) used to construct the trajectory for each dataset?
[step-200, step-400, ..., step-1400])?If possible, could you release the final selected data files (e.g., the JSON files containing the chosen subsets for Vision FLAN and LLaVA 665k)? Having access to the exact data subsets used in your experiments would serve as a crucial ground-truth reference to verify that our reproduced pipeline is functioning exactly as intended.
These details would greatly help in properly adapting XMAS. Thank you very much for your time and guidance!
Best regards,
Hyungwook