LingBot-Map: Geometric Context Transformer for Streaming 3D Reconstruction

LingBot-Map is a feed-forward 3D foundation model for streaming 3D reconstruction. It processes images or video frame-by-frame with a paged KV cache attention mechanism, producing camera poses, depth maps, and 3D point clouds in real time.

Architecture: Geometric Context Transformer (GCT) — unifies coordinate grounding, dense geometric cues, and long-range drift correction via anchor context, pose-reference window, and trajectory memory.
Performance: ~20 FPS on 518×378 resolution, stable over 10,000+ frames.
Paper: arXiv:2604.14141
License: Apache 2.0

Repository Structure

lingbot_map_cuda/
├── lingbot_map/           # Core library (pip-installable Python package)
│   ├── models/            # GCTStream, GCTBase, windowed variant
│   ├── aggregator/        # KV cache (FlashInfer/SDPA), feature aggregation
│   ├── layers/            # Attention, blocks, RoPE, patch embedding
│   ├── heads/             # Camera head (iterative refinement), DPT depth head
│   ├── utils/             # Pose encoding, geometry, image loading
│   ├── vis/               # GLB export, sky segmentation, Viser wrapper
│   └── inference.py       # Shared: load_model, postprocess, prepare_for_visualization
│
├── apps/                  # Entry points — each is a deployable unit
│   ├── cli/               # CLI demo (demo.py) + profiling (gct_profile.py)
│   ├── batch/             # Batch processing + offline rendering (batch_demo)
│   ├── rgbd_render/       # Point-cloud → video render pipeline
│   ├── viewer/            # Interactive WebSocket 3D viewer
│   └── cuda_ext/          # Compiled CUDA kernels (frustum cull, voxelize)
│
├── models/                # Binary model files
│   ├── lingbot-map-long.pt  (4.63 GB)
│   └── skyseg.onnx           (176 MB)
│
├── config/                # YAML presets for render pipeline
├── example/               # Example scenes (courthouse, university, loop, oxford)
├── assets/                # Static assets
├── docs/                  # Architecture notes, agent reference
│
├── demo.py                # Backward-compat shim → apps/cli/demo.py
├── gct_profile.py         # Backward-compat shim → apps/cli/gct_profile.py
├── demo_render/           # Backward-compat shims → apps/batch/
├── pyproject.toml         # pip install -e .
└── README.md

⚙️ Installation

1. Create conda environment

conda create -n lingbot-map python=3.10 -y
conda activate lingbot-map

2. Install PyTorch (CUDA 12.8)

pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128

3. Install lingbot-map

pip install -e .

4. Install FlashInfer (recommended)

pip install --index-url https://pypi.org/simple flashinfer-python

If FlashInfer is not installed, the model falls back to SDPA (PyTorch native attention) via --use_sdpa.

5. Visualization dependencies (optional)

pip install -e ".[vis]"

For sky masking support:

pip install onnxruntime        # CPU
# or
pip install onnxruntime-gpu    # GPU (faster for large image sets)

📦 Model Download

Model	Description
lingbot-map-long (`lingbot-map-long.pt`, 4.63 GB)	Best for long sequences / large-scale scenes (recommended)
lingbot-map	Balanced checkpoint for short + long sequences
lingbot-map-stage1	Stage-1 training; can be loaded into VGGT for bidirectional inference

Models are available on HuggingFace and ModelScope.

Place downloaded model files in models/.

🚀 Quick Start

python apps/cli/demo.py --model_path models/lingbot-map-long.pt \
    --image_folder example/courthouse --mask_sky

This launches an interactive viser 3D viewer at http://localhost:8080. The viewer loops through all frames — press Ctrl+C to stop, or use the "Playing" checkbox to pause/resume.

The root demo.py shim also works for backward compatibility:

python demo.py --model_path models/lingbot-map-long.pt \
    --image_folder example/courthouse --mask_sky

Example Scenes

Four pre-packaged scenes in example/:

Scene	Frames	Type	Notes
`example/courthouse`	286	Outdoor building	Use `--mask_sky`
`example/university`	—	Outdoor campus	Use `--mask_sky`
`example/loop`	—	Loop closure	—
`example/oxford`	—	Outdoor, large scale	Use `--mask_sky`

Working Commands

Quick test (10 frames):

python apps/cli/demo.py --model_path models/lingbot-map-long.pt \
    --image_folder example/courthouse --use_sdpa --mask_sky --first_k 10

Model loads in ~10s, inference ~3s, GPU peaks at ~4.6 GB.

Full run (286 frames, windowed mode):

python apps/cli/demo.py --model_path models/lingbot-map-long.pt \
    --image_folder example/courthouse \
    --use_sdpa --mode windowed \
    --window_size 16 --overlap_size 4 --num_scale_frames 4 \
    --offload_to_cpu --mask_sky

Processes all 286 frames in ~1 minute. Peak GPU ~5.6 GB.

Long sequences (>3000 frames):

python apps/cli/demo.py --model_path models/lingbot-map-long.pt \
    --video_path video.mp4 --fps 10 \
    --mode windowed --window_size 128 --overlap_keyframes 16 --keyframe_interval 2

Headless with NPZ export:

python apps/cli/demo.py --model_path models/lingbot-map-long.pt \
    --image_folder example/courthouse --use_sdpa \
    --headless --save_predictions outputs/

Fast inference (reduced quality):

python apps/cli/demo.py --model_path models/lingbot-map-long.pt \
    --image_folder example/courthouse --mask_sky --camera_num_iterations 1

FlashInfer (requires ≥12 GB GPU):

python apps/cli/demo.py --model_path models/lingbot-map-long.pt \
    --image_folder example/courthouse --mask_sky

🎮 Demo Flags Reference

Inference Mode

Flag	Purpose
`--mode streaming`	Default. Frame-by-frame with KV cache. Good for ≤320 frames.
`--mode windowed`	Overlapping windows + KV cache reset per window. Required for long sequences.
`--window_size N`	Keyframes per window in windowed mode. Controls KV cache memory.
`--overlap_size N`	Overlap between windows in actual frames.
`--overlap_keyframes N`	Overlap in keyframes (recommended when `keyframe_interval > 1`).

Memory Management

Flag	Default	Purpose
`--use_sdpa`	off	Use PyTorch SDPA instead of FlashInfer. Required on ≤12 GB GPUs.
`--offload_to_cpu`	on	Offload per-frame predictions to CPU during inference.
`--no-offload_to_cpu`	—	Keep predictions on GPU (needs more VRAM).
`--num_scale_frames N`	8	Bidirectional scale frames. Reduce to 2–4 to cut memory.
`--kv_cache_sliding_window N`	64	Max frames in KV cache window. Reduce to 32 to cut memory.
`--keyframe_interval N`	auto	Only cache every Nth frame. Reduces KV cache growth.

Performance

Flag	Default	Purpose
`--camera_num_iterations 1`	4	Faster inference (skips 3 refinement passes).
`--compile`	off	torch.compile hot modules. ~5 FPS faster. Adds 30–60s warmup.

Input

Flag	Purpose
`--image_folder PATH`	Folder of images (`.jpg`, `.png`, `.JPG`)
`--video_path PATH`	Video file (extracts frames at `--fps` rate)
`--first_k N`	Only load first N frames (quick tests)
`--stride N`	Take every Nth frame
`--fps N`	FPS when extracting from video (default 10)
`--rotate_clockwise_90`	Rotate images 90° CW before preprocessing
`--export_preprocessed PATH`	Export preprocessed images for inspection

Visualization

Flag	Default	Purpose
`--port`	8080	Viser viewer port
`--conf_threshold`	1.5	Min confidence for point display
`--downsample_factor`	10	Spatial downsampling of point cloud
`--point_size`	0.00001	Point size in viewer
`--mask_sky`	off	Filter sky points from point cloud (outdoor scenes)
`--sky_mask_dir PATH`	auto	Cache directory for sky masks
`--sky_mask_visualization_dir PATH`	—	Save side-by-side mask visualizations

🔑 Keyframe Interval

The KV cache stores at most 320 frames by default (the training RoPE range). When --keyframe_interval is not set:

num_frames ≤ 320 → every frame cached (interval = 1)
num_frames > 320 → auto-selected interval = ceil(num_frames / 320)

For sequences longer than ~3000 frames, switch to --mode windowed.

🎥 Offline Rendering

For sequences too long for the interactive viewer, use apps/batch/main.py for headless point-cloud flythrough MP4 generation. Supports multiple camera modes (follow, birdeye, static, pivot) configured via YAML presets in config/.

python apps/batch/main.py --model_path models/lingbot-map-long.pt \
    --input_folder example/ --output_folder outputs/ \
    --mode windowed --window_size 16 --overlap_size 4 \
    --config config/outdoor_large.yaml

Requires additional dependencies: open3d, kaolin, ffmpeg, CUDA extensions. See source comments for setup instructions.

📖 Citation

@article{chen2026geometric,
  title={Geometric Context Transformer for Streaming 3D Reconstruction},
  author={Chen, Lin-Zhuo and Gao, Jian and Chen, Yihang and Cheng, Ka Leong and Sun, Yipengjing and Hu, Liangxiao and Xue, Nan and Zhu, Xing and Shen, Yujun and Yao, Yao and Xu, Yinghao},
  journal={arXiv preprint arXiv:2604.14141},
  year={2026}
}

✨ Acknowledgments

This work builds upon several excellent open-source projects:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LingBot-Map: Geometric Context Transformer for Streaming 3D Reconstruction

Repository Structure

⚙️ Installation

📦 Model Download

🚀 Quick Start

Example Scenes

Working Commands

🎮 Demo Flags Reference

Inference Mode

Memory Management

Performance

Input

Visualization

🔑 Keyframe Interval

🎥 Offline Rendering

📖 Citation

✨ Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
apps		apps
config		config
example		example
lingbot_map		lingbot_map
scripts		scripts
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
Dockerfile.layer0		Dockerfile.layer0
LICENSE.txt		LICENSE.txt
README.md		README.md
demo.py		demo.py
gct_profile.py		gct_profile.py
lingbot-map_paper.pdf		lingbot-map_paper.pdf
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

LingBot-Map: Geometric Context Transformer for Streaming 3D Reconstruction

Repository Structure

⚙️ Installation

📦 Model Download

🚀 Quick Start

Example Scenes

Working Commands

🎮 Demo Flags Reference

Inference Mode

Memory Management

Performance

Input

Visualization

🔑 Keyframe Interval

🎥 Offline Rendering

📖 Citation

✨ Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages