L4P is a feed-forward foundational model designed for multiple low-level 4D vision perception tasks. Given a monocular video without camera poses, L4P jointly solves several tasks using a shared video encoder backbone and lightweight, task-specific heads. The model is currently trained to predict depth, optical flow, 2D/3D point tracking, dynamic motion segmentation, and camera pose estimation, and can be extended to support additional tasks.
- [2025/11] Paper is accepted at 3DV 2026 (Oral).
- [2025/9] We released the inference code.
The codebase is based on Pytorch Lightning and Lightning CLI.
conda create -n l4p python=3.10
conda activate l4p
pip install -r env/requirements.txt
You might need to install ffmpeg for mediapy. Follow instructions here.
The following assumes that Docker and NVIDIA Container are properly installed.
Everything needed to build Docker locally is provided here: env
To build docker image for local use run: docker build . -t l4p:local -f env/Dockerfile.
This will set up everything, including additional functionality for development using Docker and VSCode.
If you get any issues due to viser library, try building the image again.
To run on VSCode with docker, use the provided devcontainer file: .devcontainer/devcontainer.json.
Depending on your needs, you may want to update the mount paths based on where you store your data, results, SSH, and config files.
This can be done by modifying mounts section in .devcontainer/devcontainer.json.
Once inside docker container, to activate conda environment, use source /workspace/miniconda3/bin/activate l4p.
We provide a demo showing several examples of running the model on all the tasks we support.
Download weights and sample data using:
cd weights
bash download.sh
cd -
cd demo/data
bash download.sh
cd -
Run the demo notebook demo/demo.ipynb or run the python file cd demo; python demo.py.
If you get an OutOfMemoryError error, you could set this flag limit_gpu_mem_usage=True.
Below are example visualizations from our model for depth, flow and 2D tracks.
Because we estimate camera poses (with or without input intrinsics), we can visualize depth, camera poses, and 3D tracks within a consistent reference frame.
- Our approach is limited to 224 x 224 resolution.
- The depth, camera poses and 3D tracks are generated using different heads in a feedforward manner. So they might not be perfectly consistent with each other.
- Our current implementation of pose-alignment between overlapping windows is done on CPU, so a bit slow. A faster version is coming soon.
The sample results shown above are from:
- Perazzi et al., A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation, In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Gao et al., Monocular Dynamic View Synthesis: A Reality Check, In Advances in Neural Information Processing Systems (NeurIPS), 2022.
@inproceedings{badki2026l4p,
title={{L4P}: {T}owards Unified Low-Level {4D} Vision Perception},
author={Badki, Abhishek and Su, Hang and Wen, Bowen and Gallo, Orazio},
booktitle={International Conference on 3D Vision (3DV)},
year={2026}
}


















