Skip to content

LoganOT11/lingbot_map_cuda

Repository files navigation

LingBot-Map: Geometric Context Transformer for Streaming 3D Reconstruction

LingBot-Map is a feed-forward 3D foundation model for streaming 3D reconstruction. It processes images or video frame-by-frame with a paged KV cache attention mechanism, producing camera poses, depth maps, and 3D point clouds in real time.

  • Architecture: Geometric Context Transformer (GCT) — unifies coordinate grounding, dense geometric cues, and long-range drift correction via anchor context, pose-reference window, and trajectory memory.
  • Performance: ~20 FPS on 518×378 resolution, stable over 10,000+ frames.
  • Paper: arXiv:2604.14141
  • License: Apache 2.0

Repository Structure

lingbot_map_cuda/
├── lingbot_map/           # Core library (pip-installable Python package)
│   ├── models/            # GCTStream, GCTBase, windowed variant
│   ├── aggregator/        # KV cache (FlashInfer/SDPA), feature aggregation
│   ├── layers/            # Attention, blocks, RoPE, patch embedding
│   ├── heads/             # Camera head (iterative refinement), DPT depth head
│   ├── utils/             # Pose encoding, geometry, image loading
│   ├── vis/               # GLB export, sky segmentation, Viser wrapper
│   └── inference.py       # Shared: load_model, postprocess, prepare_for_visualization
│
├── apps/                  # Entry points — each is a deployable unit
│   ├── cli/               # CLI demo (demo.py) + profiling (gct_profile.py)
│   ├── batch/             # Batch processing + offline rendering (batch_demo)
│   ├── rgbd_render/       # Point-cloud → video render pipeline
│   ├── viewer/            # Interactive WebSocket 3D viewer
│   └── cuda_ext/          # Compiled CUDA kernels (frustum cull, voxelize)
│
├── models/                # Binary model files
│   ├── lingbot-map-long.pt  (4.63 GB)
│   └── skyseg.onnx           (176 MB)
│
├── config/                # YAML presets for render pipeline
├── example/               # Example scenes (courthouse, university, loop, oxford)
├── assets/                # Static assets
├── docs/                  # Architecture notes, agent reference
│
├── demo.py                # Backward-compat shim → apps/cli/demo.py
├── gct_profile.py         # Backward-compat shim → apps/cli/gct_profile.py
├── demo_render/           # Backward-compat shims → apps/batch/
├── pyproject.toml         # pip install -e .
└── README.md

⚙️ Installation

1. Create conda environment

conda create -n lingbot-map python=3.10 -y
conda activate lingbot-map

2. Install PyTorch (CUDA 12.8)

pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128

3. Install lingbot-map

pip install -e .

4. Install FlashInfer (recommended)

pip install --index-url https://pypi.org/simple flashinfer-python

If FlashInfer is not installed, the model falls back to SDPA (PyTorch native attention) via --use_sdpa.

5. Visualization dependencies (optional)

pip install -e ".[vis]"

For sky masking support:

pip install onnxruntime        # CPU
# or
pip install onnxruntime-gpu    # GPU (faster for large image sets)

📦 Model Download

Model Description
lingbot-map-long (lingbot-map-long.pt, 4.63 GB) Best for long sequences / large-scale scenes (recommended)
lingbot-map Balanced checkpoint for short + long sequences
lingbot-map-stage1 Stage-1 training; can be loaded into VGGT for bidirectional inference

Models are available on HuggingFace and ModelScope.

Place downloaded model files in models/.


🚀 Quick Start

python apps/cli/demo.py --model_path models/lingbot-map-long.pt \
    --image_folder example/courthouse --mask_sky

This launches an interactive viser 3D viewer at http://localhost:8080. The viewer loops through all frames — press Ctrl+C to stop, or use the "Playing" checkbox to pause/resume.

The root demo.py shim also works for backward compatibility:

python demo.py --model_path models/lingbot-map-long.pt \
    --image_folder example/courthouse --mask_sky

Example Scenes

Four pre-packaged scenes in example/:

Scene Frames Type Notes
example/courthouse 286 Outdoor building Use --mask_sky
example/university Outdoor campus Use --mask_sky
example/loop Loop closure
example/oxford Outdoor, large scale Use --mask_sky

Working Commands

Quick test (10 frames):

python apps/cli/demo.py --model_path models/lingbot-map-long.pt \
    --image_folder example/courthouse --use_sdpa --mask_sky --first_k 10

Model loads in ~10s, inference ~3s, GPU peaks at ~4.6 GB.

Full run (286 frames, windowed mode):

python apps/cli/demo.py --model_path models/lingbot-map-long.pt \
    --image_folder example/courthouse \
    --use_sdpa --mode windowed \
    --window_size 16 --overlap_size 4 --num_scale_frames 4 \
    --offload_to_cpu --mask_sky

Processes all 286 frames in ~1 minute. Peak GPU ~5.6 GB.

Long sequences (>3000 frames):

python apps/cli/demo.py --model_path models/lingbot-map-long.pt \
    --video_path video.mp4 --fps 10 \
    --mode windowed --window_size 128 --overlap_keyframes 16 --keyframe_interval 2

Headless with NPZ export:

python apps/cli/demo.py --model_path models/lingbot-map-long.pt \
    --image_folder example/courthouse --use_sdpa \
    --headless --save_predictions outputs/

Fast inference (reduced quality):

python apps/cli/demo.py --model_path models/lingbot-map-long.pt \
    --image_folder example/courthouse --mask_sky --camera_num_iterations 1

FlashInfer (requires ≥12 GB GPU):

python apps/cli/demo.py --model_path models/lingbot-map-long.pt \
    --image_folder example/courthouse --mask_sky

🎮 Demo Flags Reference

Inference Mode

Flag Purpose
--mode streaming Default. Frame-by-frame with KV cache. Good for ≤320 frames.
--mode windowed Overlapping windows + KV cache reset per window. Required for long sequences.
--window_size N Keyframes per window in windowed mode. Controls KV cache memory.
--overlap_size N Overlap between windows in actual frames.
--overlap_keyframes N Overlap in keyframes (recommended when keyframe_interval > 1).

Memory Management

Flag Default Purpose
--use_sdpa off Use PyTorch SDPA instead of FlashInfer. Required on ≤12 GB GPUs.
--offload_to_cpu on Offload per-frame predictions to CPU during inference.
--no-offload_to_cpu Keep predictions on GPU (needs more VRAM).
--num_scale_frames N 8 Bidirectional scale frames. Reduce to 2–4 to cut memory.
--kv_cache_sliding_window N 64 Max frames in KV cache window. Reduce to 32 to cut memory.
--keyframe_interval N auto Only cache every Nth frame. Reduces KV cache growth.

Performance

Flag Default Purpose
--camera_num_iterations 1 4 Faster inference (skips 3 refinement passes).
--compile off torch.compile hot modules. ~5 FPS faster. Adds 30–60s warmup.

Input

Flag Purpose
--image_folder PATH Folder of images (.jpg, .png, .JPG)
--video_path PATH Video file (extracts frames at --fps rate)
--first_k N Only load first N frames (quick tests)
--stride N Take every Nth frame
--fps N FPS when extracting from video (default 10)
--rotate_clockwise_90 Rotate images 90° CW before preprocessing
--export_preprocessed PATH Export preprocessed images for inspection

Visualization

Flag Default Purpose
--port 8080 Viser viewer port
--conf_threshold 1.5 Min confidence for point display
--downsample_factor 10 Spatial downsampling of point cloud
--point_size 0.00001 Point size in viewer
--mask_sky off Filter sky points from point cloud (outdoor scenes)
--sky_mask_dir PATH auto Cache directory for sky masks
--sky_mask_visualization_dir PATH Save side-by-side mask visualizations

🔑 Keyframe Interval

The KV cache stores at most 320 frames by default (the training RoPE range). When --keyframe_interval is not set:

  • num_frames ≤ 320 → every frame cached (interval = 1)
  • num_frames > 320 → auto-selected interval = ceil(num_frames / 320)

For sequences longer than ~3000 frames, switch to --mode windowed.


🎥 Offline Rendering

For sequences too long for the interactive viewer, use apps/batch/main.py for headless point-cloud flythrough MP4 generation. Supports multiple camera modes (follow, birdeye, static, pivot) configured via YAML presets in config/.

python apps/batch/main.py --model_path models/lingbot-map-long.pt \
    --input_folder example/ --output_folder outputs/ \
    --mode windowed --window_size 16 --overlap_size 4 \
    --config config/outdoor_large.yaml

Requires additional dependencies: open3d, kaolin, ffmpeg, CUDA extensions. See source comments for setup instructions.


📖 Citation

@article{chen2026geometric,
  title={Geometric Context Transformer for Streaming 3D Reconstruction},
  author={Chen, Lin-Zhuo and Gao, Jian and Chen, Yihang and Cheng, Ka Leong and Sun, Yipengjing and Hu, Liangxiao and Xue, Nan and Zhu, Xing and Shen, Yujun and Yao, Yao and Xu, Yinghao},
  journal={arXiv preprint arXiv:2604.14141},
  year={2026}
}

✨ Acknowledgments

This work builds upon several excellent open-source projects:

About

Lingbot_map

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages