FoundationMotion

FoundationMotion offers a scalable way to curate detailed motion datasets, enabling effective fine-tuning of diverse models (VLM / VLA / World Models) to improve motion and spatial reasoning.

Environment Setup

If you want to construct datasets using our dataset curation pipeline: see installation instructions in data_pipeline/README.md

If you want to use our finetuned model:

pip install fire tqdm huggingface-hub
pip install -U https://github.com/NVlabs/VILA

Usage

Data curation

Follow the instructions in data_pipeline/README.md to set up the video files you want to process. Customize your paths and settings in data_pipeline/scripts/config.sh.

Run:

bash data_pipeline/scripts/submit_ranges.sh

This will start processing video data. Modify submit_range 0 60 to specify the range of videos to process — 0 is the starting index and 60 is the ending index. You can submit multiple jobs with different or even overlapping ranges; we handled all the rest for you. Just submit your jobs and adjust the start/end values as needed.

Eval

python eval/vila_motionbench.py \
    --task="robotics_hand_eval" \
    --base_dir="~/workspace/v2-dev" \
    --model_path="WoWolf/nvila_15b_video-fm-tuned"

Interactive Demo

Full Huggingface Demo (this demo is also hosted on Huggingface Spaces)

Run the demo:

python app.py

Drag a video, ask a question, and get an ansewer.

Examples

[Data Curation] Process a single video

data_pipeline/process_single_video.py - script to process a single video to get trajectories, captions, and question–answer pairs.

python process_single_video.py --video_path /path/to/video.mp4 --base_output_dir /path/to/output

[Model Inference] Process a single video

examples/demo_nvila.py - script to process a single video using our model.

python demo_nvila.py --video_path /path/to/video.mp4 --prompt "Your question here"

Citation

If you use our work or our implementation in this repo, or find them helpful, please consider giving a citation in the following format.

@misc{gan2025foundationmotion,
    title={FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos}, 
    author={Yulu Gan and Ligeng Zhu and Dandan Shan and Baifeng Shi and Hongxu Yin and Boris Ivanovic and Song Han and Trevor Darrell and Jitendra Malik and Marco Pavone and Boyi Li},
    year={2025},
    eprint={2512.10927},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2512.10927}, 
}

FoundationMotion is also referred to as Wolf V2 🐺, the second chapter in the Wolf series: https://wolfv0.github.io/.

⭐ Star History

↑ Back to Top ↑

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets		assets
data_pipeline		data_pipeline
demo		demo
eval		eval
examples		examples
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FoundationMotion

Environment Setup

Usage

Data curation

Eval

Interactive Demo

Examples

[Data Curation] Process a single video

[Model Inference] Process a single video

Citation

FoundationMotion is also referred to as Wolf V2 🐺, the second chapter in the Wolf series: https://wolfv0.github.io/.

⭐ Star History

About

Uh oh!

Releases

Packages

Contributors 4

Languages

License

Wolfv0/FoundationMotion

Folders and files

Latest commit

History

Repository files navigation

FoundationMotion

Environment Setup

Usage

Data curation

Eval

Interactive Demo

Examples

[Data Curation] Process a single video

[Model Inference] Process a single video

Citation

FoundationMotion is also referred to as Wolf V2 🐺, the second chapter in the Wolf series: https://wolfv0.github.io/.

⭐ Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages