This repository is divided into two main sections:
Our Survey Paper Collection - This section presents our survey, "Simulating the Real World: A Unified Survey of Multimodal Generative Models", which systematically unify the study of 2D, video, 3D and 4D generation within a single framework.
Text2X Resources – This section continues the original Awesome-Text2X-Resources, an open collection of state-of-the-art (SOTA) and novel Text-to-X (X can be everything) methods, including papers, codes, and datasets. The goal is to track the rapid progress in this field and provide researchers with up-to-date references.
⭐ If you find this repository useful for your research or work, a star is highly appreciated!
💗 This repository is continuously updated. If you find relevant papers, blog posts, videos, or other resources that should be included, feel free to submit a pull request (PR) or open an issue. Community contributions are always welcome!
- ✨ [13 Aug 2025] Updated our survey (Version 2, 25 pages) on arXiv.
- ✨ [6 Mar 2025] Updated our survey (Version 1) on arXiv.
𝐒𝐢𝐦𝐮𝐥𝐚𝐭𝐢𝐧𝐠 𝐭𝐡𝐞 𝐑𝐞𝐚𝐥 𝐖𝐨𝐫𝐥𝐝: 𝐀 𝐔𝐧𝐢𝐟𝐢𝐞𝐝 𝐒𝐮𝐫𝐯𝐞𝐲 𝐨𝐟 𝐌𝐮𝐥𝐭𝐢𝐦𝐨𝐝𝐚𝐥 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐌𝐨𝐝𝐞𝐥𝐬
Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics and future directions, and fostering insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.
If you find this paper and repo helpful for your research, please cite it below:
@article{hu2025simulating,
title={Simulating the Real World: A Unified Survey of Multimodal Generative Models},
author={Hu, Yuqi and Wang, Longguang and Liu, Xian and Chen, Ling-Hao and Guo, Yuwei and Shi, Yukai and Liu, Ce and Rao, Anyi and Wang, Zeyu and Xiong, Hui},
journal={arXiv preprint arXiv:2503.04641},
year={2025}
}
Note
If you are new to this field, you can find clear and concise definitions of essential technical terms and concepts, such as NeRF, 3DGS, SDS, and Diffusion Models in our Glossary.
Tip
Feel free to pull requests or contact us if you find any related papers that are not included here. The process to submit a pull request is as follows:
- a. Fork the project into your own repository.
- b. Add the Title, Paper link, Conference, Project/GitHub link in
README.mdusing the following format:
[Origin] **Paper Title** [[Paper](Paper Link)] [[GitHub](GitHub Link)] [[Project Page](Project Page Link)]
- c. Submit the pull request to this branch.
Here are some seminal papers and models.
- Imagen: [NeurIPS 2022] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [Paper] [Project Page]
- DALL-E: [ICML 2021] Zero-shot text-to-image generation [Paper] [GitHub]
- DALL-E 2: [arXiv 2022] Hierarchical Text-Conditional Image Generation with CLIP Latents [Paper]
- DALL-E 3: [Platform Link]
- DeepFloyd IF: [GitHub]
- Stable Diffusion: [CVPR 2022] High-Resolution Image Synthesis with Latent Diffusion Models [Paper] [GitHub]
- SDXL: [ICLR 2024 spotlight] SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis [Paper] [GitHub]
- FLUX.1: [Platform Link]
Text-to-video generation models adapt text-to-image frameworks to handle the additional dimension of dynamics in the real world. We classify these models into three categories based on different generative machine learning architectures.
- [AIRC 2023] A Survey of AI Text-to-Image and AI Text-to-Video Generators [Paper]
- [arXiv 2024] Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation [Paper]
VAE-based Approaches.
- SV2P: [ICLR 2018 Poster] Stochastic Variational Video Prediction [Paper] [Project Page]
- [arXiv 2021] FitVid: Overfitting in Pixel-Level Video Prediction [Paper] [GitHub] [Project Page]
GAN-based Approaches.
- [CVPR 2018] MoCoGAN: Decomposing Motion and Content for Video Generation [Paper] [GitHub]
- [CVPR 2022] StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2 [Paper] [GitHub] [Project Page]
- DIGAN: [ICLR 2022] Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks [Paper] [GitHub] [Project Page]
- [ICCV 2023] StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation [Paper] [GitHub] [Project Page]
U-Net-based Architectures.
- [NeurIPS 2022] Video Diffusion Models [Paper] [Project Page]
- [arXiv 2022] Imagen Video: High Definition Video Generation with Diffusion Models [Paper] [Project Page]
- [arXiv 2022] MagicVideo: Efficient Video Generation With Latent Diffusion Models [Paper] [Project Page]
- [ICLR 2023 Poster] Make-A-Video: Text-to-Video Generation without Text-Video Data [Paper] [Project Page]
- GEN-1: [ICCV 2023] Structure and Content-Guided Video Synthesis with Diffusion Models [Paper] [Project Page]
- PYoCo: [ICCV 2023] Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models [Paper] [Project Page]
- [CVPR 2023] Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models [Paper] [Project Page]
- [IJCV 2024] Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation [Paper] [GitHub] [Project Page]
- [NeurIPS 2024] VideoComposer: Compositional Video Synthesis with Motion Controllability [Paper] [GitHub] [Project Page]
- [ICLR 2024 Spotlight] AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning [Paper] [GitHub] [Project Page]
- [CVPR 2024] Make Pixels Dance: High-Dynamic Video Generation [Paper] [Project Page]
- [ECCV 2024] Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning [Paper] [Project Page]
- [SIGGRAPH Asia 2024] Lumiere: A Space-Time Diffusion Model for Video Generation [Paper] [Project Page]
Transformer-based Architectures.
- [ICLR 2024 Poster] VDT: General-purpose Video Diffusion Transformers via Mask Modeling [Paper] [GitHub] [Project Page]
- W.A.L.T: [ECCV 2024] Photorealistic Video Generation with Diffusion Models [Paper] [Project Page]
- [CVPR 2024] Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis [Paper] [Project Page]
- [CVPR 2024] GenTron: Diffusion Transformers for Image and Video Generation [Paper] [Project Page]
- [ICLR 2025 Poster] CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [Paper] [GitHub]
- [ICLR 2025 Spotlight] Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers [Paper] [GitHub]
- VQ-GAN: [CVPR 2021 Oral] Taming Transformers for High-Resolution Image Synthesis [Paper] [GitHub]
- [CVPR 2023 Highlight] MAGVIT: Masked Generative Video Transformer [Paper] [GitHub] [Project Page]
- [ICLR 2023 Poster] CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers [Paper] [GitHub]
- [ICML 2024] VideoPoet: A Large Language Model for Zero-Shot Video Generation [Paper] [Project Page]
- [ICLR 2024 Poster] Language Model Beats Diffusion - Tokenizer is key to visual generation [Paper]
- [arXiv 2024] Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation [Paper] [GitHub]
- [arXiv 2024] Emu3: Next-Token Prediction is All You Need [Paper] [GitHub] [Project Page]
- [ICLR 2025 Poster] Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding [Paper] [GitHub]
- [ICCV 2023] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation [Paper] [GitHub] [Project Page]
- [ICCV 2023] Pix2Video: Video Editing using Image Diffusion [Paper] [GitHub] [Project Page]
- [CVPR 2024] VidToMe: Video Token Merging for Zero-Shot Video Editing [Paper] [GitHub] [Project Page]
- [CVPR 2024] Video-P2P: Video Editing with Cross-attention Control [Paper] [GitHub] [Project Page]
- [CVPR 2024 Highlight] CoDeF: Content Deformation Fields for Temporally Consistent Video Processing [Paper] [GitHub] [Project Page]
- [NeurIPS 2024] Towards Consistent Video Editing with Text-to-Image Diffusion Models [Paper]
- [ICLR 2024 Poster] Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models [Paper] [GitHub] [Project Page]
- [arXiv 2024] UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing [Paper] [GitHub] [Project Page]
- [TMLR 2024] AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks [Paper] [GitHub] [Project Page]
- [TPAMI 2025] ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis [Paper] [GitHub] [Project Page]
- [CVPR 2024 Highlight] ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models [Paper] [GitHub] [Project Page]
- [ICLR 2025 Poster] CameraCtrl: Enabling Camera Control for Video Diffusion Models [Paper] [GitHub] [Project Page]
- [ICLR 2025 Poster] NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer [Paper] [GitHub]
- [ICCV 2019] Everybody Dance Now [Paper] [GitHub] [Project Page]
- [ICCV 2019] Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis [Paper] [GitHub] [Project Page] [Dataset]
- [NeurIPS 2019] First Order Motion Model for Image Animation [Paper] [GitHub] [Project Page]
- [ICCV 2023] Adding Conditional Control to Text-to-Image Diffusion Models [Paper] [GitHub]
- [ICCV 2023] HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation [Paper] [GitHub] [Project Page]
- [CVPR 2023] Learning Locally Editable Virtual Humans [Paper] [GitHub] [Project Page] [Dataset]
- [CVPR 2023] Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation [Paper] [GitHub] [Project Page]
- [CVPRW 2024] LatentMan: Generating Consistent Animated Characters using Image Diffusion Models [Paper] [GitHub] [Project Page]
- [IJCAI 2024] Zero-shot High-fidelity and Pose-controllable Character Animation [Paper]
- [SCIS-2025] UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation [Paper] [GitHub] [Project Page]
- [CVPR 2025] MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling [Paper] [GitHub] [Project Page]
- [arXiv 2023] Generative AI meets 3D: A Survey on Text-to-3D in AIGC Era [Paper]
- [arXiv 2024] Advances in 3D Generation: A Survey [Paper]
- [arXiv 2024] A Survey On Text-to-3D Contents Generation In The Wild [Paper]
- [arXiv 2022] 3D-LDM: Neural Implicit 3D Shape Generation with Latent Diffusion Models [Paper] [GitHub]
- [arXiv 2022] Point-E: A System for Generating 3D Point Clouds from Complex Prompts [Paper] [GitHub]
- [arXiv 2023] Shap-E: Generating Conditional 3D Implicit Functions [Paper] [GitHub]
- [NeurIPS 2023] Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation [Paper] [GitHub] [Project Page]
- [ICCV 2023] ATT3D: Amortized Text-to-3D Object Synthesis [Paper] [Project Page]
- [ICLR 2023 Spotlight] MeshDiffusion: Score-based Generative 3D Mesh Modeling [Paper] [GitHub] [Project Page]
- [CVPR 2023] Diffusion-SDF: Text-to-Shape via Voxelized Diffusion [Paper] [GitHub] [Project Page]
- [ICML 2024] HyperFields:Towards Zero-Shot Generation of NeRFs from Text [Paper] [GitHub] [Project Page]
- [ECCV 2024] LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis [Paper] [Project Page]
- [arXiv 2024] AToM: Amortized Text-to-Mesh using 2D Diffusion [Paper] [GitHub] [Project Page]
- [ICLR 2023 notable top 5%] DreamFusion: Text-to-3D using 2D Diffusion [Paper] [Project Page]
- [CVPR 2023 Highlight] Magic3D: High-Resolution Text-to-3D Content Creation [Paper] [Project Page]
- [CVPR 2023] Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models [Paper] [Project Page]
- [ICCV 2023] Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation [Paper] [GitHub] [Project Page]
- [NeurIPS 2023 Spotlight] ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation [Paper] [GitHub] [Project Page]
- [ICLR 2024 Poster] MVDream: Multi-view Diffusion for 3D Generation [Paper] [GitHub] [Project Page]
- [ICLR 2024 Oral] DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation [Paper] [GitHub] [Project Page]
- [CVPR 2024] PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion [Paper]
- [CVPR 2024] VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation [Paper] [Project Page]
- [CVPR 2024] GSGEN: Text-to-3D using Gaussian Splatting [Paper] [GitHub] [Project Page]
- [CVPR 2024] GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models [Paper] [GitHub] [Project Page]
- [CVPR 2024] Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior [Paper] [GitHub] [Project Page]
- [ICLR 2024 Poster] Instant3D: Fast Text-to-3D with Sparse-view Generation and Large Reconstruction Model [Paper] [Project Page]
- [CVPR 2024] Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion [Paper] [GitHub] [Project Page]
- [CVPR 2024] Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior [Paper] [GitHub] [Project Page]
- [arXiv 2023] 3DGen: Triplane Latent Diffusion for Textured Mesh Generation [Paper]
- [NeurIPS 2023] Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation [Paper] [GitHub] [Project Page]
- [NeurIPS 2024] Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer [Paper] [GitHub] [Project Page]
- [SIGGRAPH 2024 Best Paper Honorable Mention] CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets [Paper] [GitHub] [Project Page]
- [arXiv 2024] CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner [Paper] [GitHub] [Project Page]
- [CVPR 2025] Structured 3D Latents for Scalable and Versatile 3D Generation [Paper] [GitHub] [Project Page]
- [arXiv 2023] Consistent123: Improve Consistency for One Image to 3D Object Synthesis [Paper] [Project Page]
- [arXiv 2023] ImageDream: Image-Prompt Multi-view Diffusion for 3D Generation [Paper] [GitHub] [Project Page]
- [CVPR 2023] RealFusion: 360° Reconstruction of Any Object from a Single Image [Paper] [GitHub] [Project Page]
- [ICCV 2023] Zero-1-to-3: Zero-shot One Image to 3D Object [Paper] [GitHub] [Project Page]
- [ICLR 2024 Poster] Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors [Paper] [GitHub] [Project Page]
- [ICLR 2024 Poster] TOSS: High-quality Text-guided Novel View Synthesis from a Single Image [Paper] [GitHub] [Project Page]
- [ICLR 2024 Spotlight] SyncDreamer: Generating Multiview-consistent Images from a Single-view Image [Paper] [GitHub] [Project Page]
- [CVPR 2024] Wonder3D: Single Image to 3D using Cross-Domain Diffusion [Paper] [GitHub] [Project Page]
- [ICLR 2025] IPDreamer: Appearance-Controllable 3D Object Generation with Complex Image Prompts [Paper] [GitHub]
- [NeurIPS 2023] One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization [Paper] [GitHub] [Project Page]
- [ECCV 2024] CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model [Paper] [GitHub] [Project Page]
- [arXiv 2024] InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models [Paper] [GitHub]
- [ICLR 2024 Oral] LRM: Large Reconstruction Model for Single Image to 3D [Paper] [Project Page]
- [NeurIPS 2024] Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image [Paper] [GitHub] [Project Page]
- [CVPR 2024 Highlight] ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models [Paper] [GitHub] [Project Page]
- [ICML 2024] IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation [Paper] [Project Page]
- [TPAMI 2025] V3D: Video Diffusion Models are Effective 3D Generators [Paper] [GitHub] [Project Page]
- [ECCV 2024 Oral] SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image Using Latent Video Diffusion [Paper] [Project Page]
- [NeurIPS 2024 Oral] CAT3D: Create Anything in 3D with Multi-View Diffusion Models [Paper] [Project Page]
- [CVPR 2023] Zero-Shot Text-to-Parameter Translation for Game Character Auto-Creation [Paper]
- [SIGGRAPH 2023] DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance [Paper] [Project Page]
- [NeurIPS 2023] Headsculpt: Crafting 3d head avatars with text [Paper] [GitHub] [Project Page]
- [NeurIPS 2023] DreamWaltz: Make a Scene with Complex 3D Animatable Avatars [Paper] [GitHub] [Project Page]
- [NeurIPS 2023 Spotlight] DreamHuman: Animatable 3D Avatars from Text [Paper] [Project Page]
- [ACM MM 2023] RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture [Paper]
- [TVCG 2024] Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields [Paper] [GitHub] [Project Page]
- [ECCV 2024] DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling [Paper] [GitHub] [Project Page]
- [ECCV 2024] DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting [Paper] [GitHub] [Project Page]
- [arXiv 2024] Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior [Paper] [GitHub] [Project Page]
- [arXiv 2024] CityCraft: A Real Crafter for 3D City Generation [Paper] [GitHub]
- [ECCV 2022] Unified Implicit Neural Stylization [Paper] [GitHub] [Project Page]
- [ECCV 2022] ARF: Artistic Radiance Fields [Paper] [GitHub] [Project Page]
- [SIGGRAPH Asia 2022] FDNeRF: Few-shot Dynamic Neural Radiance Fields for Face Reconstruction and Expression Editing [Paper] [GitHub] [Project Page]
- [CVPR 2022] FENeRF: Face Editing in Neural Radiance Fields [Paper] [GitHub] [Project Page]
- [SIGGRAPH 2023] TextDeformer: Geometry Manipulation using Text Guidance [Paper] [GitHub] [Project Page]
- [ICCV 2023] ObjectSDF++: Improved Object-Compositional Neural Implicit Surfaces [Paper] [GitHub] [Project Page]
- [ICCV 2023 Oral] Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions [Paper] [GitHub] [Project Page]
- [CVPR 2024] Control4D: Efficient 4D Portrait Editing with Text [Paper] [Project Page]
- [NeurIPS 2024] Animate3D: Animating Any 3D Model with Multi-view Video Diffusion [Paper] [GitHub] [Project Page]
- [NeurIPS 2024] Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels [Paper] [GitHub] [Project Page]
- [NeurIPS 2024] Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models [Paper] [GitHub] [Project Page] [Dataset]
- [NeurIPS 2024] L4GM: Large 4D Gaussian Reconstruction Model [Paper] [GitHub] [Project Page]
- [ICML 2023] Text-To-4D Dynamic Scene Generation [Paper] [Project Page]
- [CVPR 2024] 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling [Paper] [GitHub] [Project Page]
- [CVPR 2024] A Unified Approach for Text- and Image-guided 4D Scene Generation [Paper] [GitHub] [Project Page]
- [CVPR 2024] Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models [Paper] [Project Page]
- [ECCV 2024] TC4D: Trajectory-Conditioned Text-to-4D Generation [Paper] [GitHub] [Project Page]
- [ECCV 2024] SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer [Paper] [GitHub] [Project Page]
- [ECCV 2024] STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians [Paper] [GitHub] [Project Page]
- [NeurIPS 2024] 4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models [Paper] [Project Page]
- [NeurIPS 2024] Compositional 3D-aware Video Generation with LLM Director [Paper] [Project Page]
- [NeurIPS 2024] DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos [Paper] [GitHub] [Project Page]
- [NeurIPS 2024] DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation [Paper] [GitHub] [Project Page]
- [arXiv 2024] Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis [Paper] [GitHub]
- [CVPR 2024] Control4D: Efficient 4D Portrait Editing with Text [Paper] [Project Page]
- [CVPR 2024] Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion [Paper] [GitHub] [Project Page]
- [SIGGRAPH 2020] Robust Motion In-betweening [Paper]
- [CVPR 2022] Generating Diverse and Natural 3D Human Motions from Text [Paper] [GitHub] [Project Page]
- [SCA 2023] Motion In-Betweening with Phase Manifolds [Paper] [GitHub]
- [CVPR 2023] T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations [Paper] [GitHub] [Project Page]
- [ICLR 2023 notable top 25%] Human Motion Diffusion Model [Paper] [GitHub] [Project Page]
- [NeurIPS 2023] MotionGPT: Human Motion as a Foreign Language [Paper] [GitHub] [Project Page]
- [ICML 2024] HumanTOMATO: Text-aligned Whole-body Motion Generation [Paper] [GitHub] [Project Page]
- [CVPR 2024] MoMask: Generative Masked Modeling of 3D Human Motions [Paper] [GitHub] [Project Page]
- [CVPR 2024] Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives [Paper] [GitHub] [Project Page]
- [arXiv 2025] WorldModelBench: Judging Video Generation Models As World Models [Paper] [GitHub] [Project Page]
-
NVIDIA Cosmos ([GitHub] [Paper]): NVIDIA Cosmos is a world foundation model platform for accelerating the development of physical AI systems.
- Cosmos-Transfer1:a world-to-world transfer model designed to bridge the perceptual divide between simulated and real-world environments.
- Cosmos-Predict1: a collection of general-purpose world foundation models for Physical AI that can be fine-tuned into customized world models for downstream applications.
- Cosmos-Reason1: a model that understands the physical common sense and generate appropriate embodied decisions in natural language through long chain-of-thought reasoning processes.
-
Genie3, Google Deepmind, August 5th, 2025.
🎯Back to Top - Our Survey Paper Collection
An open collection of state-of-the-art (SOTA), novel Text to X (X can be everything) methods (papers, codes and datasets), intended to keep pace with the anticipated surge of research.
2026.01.07- update 2025 papers collection into docs.
2025 Update Logs:
2025.12.03- update several papers accepted by NeurIPS 2025, congrats to all 🎉2025.05.08- update new layout.2025.04.18- update layout on section Related Resources.2025.03.10- CVPR 2025 Accepted Papers🎉2025.02.28- update several papers status "CVPR 2025" to accepted papers, congrats to all 🎉2025.01.23- update several papers status "ICLR 2025" to accepted papers, congrats to all 🎉2025.01.09- update layout.
2024 Update Logs:
2024.12.21adjusted the layouts of several sections and Happy Winter Solstice ⚪🥣.2024.09.26- update several papers status "NeurIPS 2024" to accepted papers, congrats to all 🎉2024.09.03- add one new section 'text to model'.2024.06.30- add one new section 'text to video'.2024.07.02- update several papers status "ECCV 2024" to accepted papers, congrats to all 🎉2024.06.21- add one hot Topic about AIGC 4D Generation on the section of Suvery and Awesome Repos.2024.06.17- an awesome repo for CVPR2024 Link 👍🏻2024.04.05adjusted the layout and added accepted lists and ArXiv lists to each section.2024.04.05- an awesome repo for CVPR2024 on 3DGS and NeRF Link 👍🏻2024.03.25- add one new survey paper of 3D GS into the section of "Survey and Awesome Repos--Topic 1: 3D Gaussian Splatting".2024.03.12- add a new section "Dynamic Gaussian Splatting", including Neural Deformable 3D Gaussians, 4D Gaussians, Dynamic 3D Gaussians.2024.03.11- CVPR 2024 Accpeted Papers Link- update some papers accepted by CVPR 2024! Congratulations🎉
Melonie de Almeida, Daniela Ivanova, Tong Shi, John H. Williamson, Paul Henderson (University of Glasgow)
Abstract
Humans excel at forecasting the future dynamics of a scene given just a single image. Video generation models that can mimic this ability are an essential component for intelligent systems. Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation. However, these methods often lack robust user controllability, such as modifying the camera path, limiting their applicability in real-world applications. Most existing camera-controlled image-to-video models struggle with accurately modeling camera motion, maintaining temporal consistency, and preserving geometric integrity. Leveraging explicit intermediate 3D representations offers a promising solution by enabling coherent video generation aligned with a given camera trajectory. Although these methods often use 3D point clouds to render scenes and introduce object motion in a later stage, this two-step process still falls short in achieving full temporal consistency, despite allowing precise control over camera movement. We propose a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion, given a single image in a single forward pass. This enables fast, camera-guided video generation without the need for iterative denoising to inject object motion into render frames. Extensive experiments on the KITTI, Waymo, RealEstate10K and DL3DV-10K datasets demonstrate that our method achieves state-of-the-art video quality and inference efficiency.Yanzhe Lyu, Chen Geng, Karthik Dharmarajan, Yunzhi Zhang, Hadi Alzayer, Shangzhe Wu, Jiajun Wu (Stanford University, University of Cambridge, University of Maryland)
Abstract
Dynamic objects in our physical 4D (3D + time) world are constantly evolving, deforming, and interacting with other objects, leading to diverse 4D scene dynamics. In this paper, we present a universal generative pipeline, CHORD, for CHOReographing Dynamic objects and scenes and synthesizing this type of phenomena. Traditional rule-based graphics pipelines to create these dynamics are based on category-specific heuristics, yet are labor-intensive and not scalable. Recent learning-based methods typically demand large-scale datasets, which may not cover all object categories in interest. Our approach instead inherits the universality from the video generative models by proposing a distillation-based pipeline to extract the rich Lagrangian motion information hidden in the Eulerian representations of 2D videos. Our method is universal, versatile, and category-agnostic. We demonstrate its effectiveness by conducting experiments to generate a diverse range of multi-body 4D dynamics, show its advantage compared to existing methods, and demonstrate its applicability in generating robotics manipulation policies.| Year | Title | ArXiv Time | Paper | Code | Project Page |
|---|---|---|---|---|---|
| 2026 | Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians | 2 Jan 2026 | Link | -- | Link |
| 2026 | Choreographing a World of Dynamic Objects | 7 Jan 2026 | Link | -- | Link |
ArXiv Papers References
%axiv papers
@article{de2026pixel,
title={Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians},
author={de Almeida, Melonie and Ivanova, Daniela and Shi, Tong and Williamson, John H and Henderson, Paul},
journal={arXiv preprint arXiv:2601.00678},
year={2026}
}
@misc{lyu2026choreographingworlddynamicobjects,
title={Choreographing a World of Dynamic Objects},
author={Yanzhe Lyu and Chen Geng and Karthik Dharmarajan and Yunzhi Zhang and Hadi Alzayer and Shangzhe Wu and Jiajun Wu},
year={2026},
eprint={2601.04194},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.04194},
}
For more details, please check the 2025 4D Papers, including 23 accepted papers, 18 arXiv papers and 2 arXiv surveys.
For more details, please check the 2024 4D Papers, including 24 accepted papers and 10 arXiv papers.
In 2023, tasks classified as text/Image to 4D and video to 4D generally involve producing four-dimensional data from text/Image or video input. For more details, please check the 2023 4D Papers, including 6 accepted papers and 3 arXiv papers.
For more details, please check the 2025 T2V Papers, including 11 accepted papers and 16 arXiv papers.
For more details, please check the 2024 T2V Papers, including 21 accepted papers and 7 arXiv papers.
- OSS video generation models: Mochi 1 preview is an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence.
- Survey: The Dawn of Video Generation: Preliminary Explorations with SORA-like Models, arXiv, Project Page, GitHub Repo
For more details, please check the 2025 3D Scene Papers, including 12 accepted papers, 13 arXiv papers and 2 arXiv surveys.
For more details, please check the 2023-2024 3D Scene Papers, including 23 accepted papers and 8 arXiv papers.
Awesome Repos
- Resource1: WorldGen: Generate Any 3D Scene in Seconds
- Resource2: RTFM: A Real-Time Frame Model Blog Demo Try-on
For more details, please check the 2025 Human Motion Papers, including 14 accepted papers and 8 arXiv papers.
For more details, please check the 2023-2024 Text to Human Motion Papers, including 36 accepted papers and 6 arXiv papers.
| Motion | Info | URL | Others |
|---|---|---|---|
| AIST | AIST Dance Motion Dataset | Link | -- |
| AIST++ | AIST++ Dance Motion Dataset | Link | dance video database with SMPL annotations |
| AMASS | optical marker-based motion capture datasets | Link | -- |
AMASS
AMASS is a large database of human motion unifying different optical marker-based motion capture datasets by representing them within a common framework and parameterization. AMASS is readily useful for animation, visualization, and generating training data for deep learning.
Awesome Repos
For more details, please check the 2025 3D Human Papers, including 11 accepted papers and 2 arXiv papers.
For more details, please check the 2023-2024 3D Human Papers, including 19 accepted papers and 4 arXiv papers.
Survey and Awesome Repos
- Resource1: Awesome Digital Human
Pretrained Models
| Pretrained Models (human body) | Info | URL |
|---|---|---|
| SMPL | smpl model (smpl weights) | Link |
| SMPL-X | smpl model (smpl weights) | Link |
| human_body_prior | vposer model (smpl weights) | Link |
SMPL
SMPL is an easy-to-use, realistic, model of the of the human body that is useful for animation and computer vision.
- version 1.0.0 for Python 2.7 (female/male, 10 shape PCs)
- version 1.1.0 for Python 2.7 (female/male/neutral, 300 shape PCs)
- UV map in OBJ format
SMPL-X
SMPL-X, that extends SMPL with fully articulated hands and facial expressions (55 joints, 10475 vertices)
🎯Back to Top - Text2X Resources
Here, other tasks refer to CAD, 3D modeling, music generation, and so on.
- [arXiv 7 Nov 2024] CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM [Paper] [GitHub] [Project Page]
- [NeurIPS 2024 Spotlight] Text2CAD: Generating Sequential CAD Designs from Beginner-to-Expert Level Text Prompts [Paper] [GitHub] [Project Page] [Dataset]
- [CVPR 2025] CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation [Paper]
- [arXiv 1 Sep 2024] FLUX that Plays Music [Paper] [GitHub]
- [International Society for Music Information Retrieval(ISMIR) 2025] Video-Guided Text-to-Music Generation Using Public Domain Movie Collections [Paper] [Code] [Project Page]
- [ICLR Workshop on Neural Network Weights as a New Data Modality 2025] Text-to-Model: Text-Conditioned Neural Network Diffusion for Train-Once-for-All Personalization [Paper]
🔥 Topic 1: 3D Gaussian Splatting
- [arXiv 6 May 2024] Gaussian Splatting: 3D Reconstruction and Novel View Synthesis, a Review [Paper]
- [arXiv 17 Mar 2024] Recent Advances in 3D Gaussian Splatting [Paper]
- [IEEE TVCG 2024] 3D Gaussian as a New Vision Era: A Survey [Paper]
- [arXiv 8 Jan 2024] A Survey on 3D Gaussian Splatting [Paper] [GitHub] [Benchmark]
- Resource1: Awesome 3D Gaussian Splatting Resources
- Resource2: 3D Gaussian Splatting Papers
- Resource3: 3DGS and Beyond Docs
🔥 Topic 2: AIGC 3D
- [arXiv 15 May 2024] A Survey On Text-to-3D Contents Generation In The Wild [Paper]
- [arXiv 2 Feb 2024] A Comprehensive Survey on 3D Content Generation [Paper] [GitHub]
- [arXiv 31 Jan 2024] Advances in 3D Generation: A Survey [Paper]
- Resource1: Awesome 3D AIGC Resources
- Resource2: Awesome-Text/Image-to-3D
- Resource3: Awesome Text-to-3D
- [CVPR 2024] GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation [Paper] [GitHub] [Project Page]
🔥 Topic 3: 3D Human & LLM 3D
- [arXiv 6 June 2024] A Survey on 3D Human Avatar Modeling -- From Reconstruction to Generation [Paper]
- [arXiv 5 Jan 2024] Progress and Prospects in 3D Generative AI: A Technical Overview including 3D human [Paper]
- Resource1: Awesome LLM 3D
- Resource2: Awesome Digital Human
- Resource3: Awesome-Avatars
🔥 Topic 5: Physics-based AIGC
Dynamic Gaussian Splatting
- [CVPR 2024] Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction [Paper] [GitHub] [Project Page]
- [CVPR 2024] 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering [Paper] [GitHub] [Project Page]
- [CVPR 2024] SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes [Paper] [GitHub] [Project Page]
- [CVPR 2024 Highlight] 3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos [Paper] [GitHub] [Project Page]
- [SIGGRAPH 2024] 4D-Rotor Gaussian Splatting: Towards Efficient Novel View Synthesis for Dynamic Scenes [Paper]
- [ICLR 2024] Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting [Paper] [GitHub] [Project Page]
- [CVPR 2024 Highlight] Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle [Paper] [GitHub] [Project Page]
- [3DV 2024] Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis [Paper] [GitHub] [Project Page]
🎯Back to Top - Table of Contents
This repo is released under the MIT license.
✉️ Any additions or suggestions, feel free to contact us.

