Skip to content

Official implementation for "JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation"

Notifications You must be signed in to change notification settings

MIV-XJTU/JanusVLN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JanusVLN Logo

JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation

Shuang Zeng1,2, Dekang Qi1, Xinyuan Chang1, Feng Xiong1, Shichao Xie1, Xiaolong Wu1, Shiyi Liang1,2, Mu Xu1, Xing Wei2

1Amap, Alibaba Group, 2Xi’an Jiaotong University

arXiv Website Video HuggingFace HuggingFace HF Demo

janusvln.mp4

💡 Introduction

JanusVLN is a novel VLN framework and the first to feature a dual implicit memory. Inspired by the implicit scene representation in human navigation, which integrates left-brain semantic understanding with right-brain spatial cognition, JanusVLN constructs two complementary, fixed-size, compact neural memory. JanusVLN steers VLN research from 2D semantics-dominant toward 3D spatial-semantisynergy, a critical direction for developing next-generation spatial embodied agents.

JanusVLN

📢 News

[2025-11-06] Due to the previous upload of incorrect weights for the JanusVLN_Extra model, if you need to directly infer, please download the correct weights from JanusVLN_Extra again.

Table of Contents

🛠️ Installation

Create the required environment through the following steps:

git clone https://github.com/MIV-XJTU/JanusVLN.git && cd JanusVLN

conda create -n janusvln python=3.9 -y && conda activate janusvln

conda install habitat-sim==0.2.4 withbullet headless -c conda-forge -c aihabitat

git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
pip install -e habitat-lab
pip install -e habitat-baselines
cd ..

# CUDA 12.4
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

pip install -r requirements.txt
# Install JanusVLN
pip install -e .

📦 Data Preparation

1、Scene Datasets

  • For R2R, RxR: Download the MP3D scenes from the official project page, and place them under data/scene_datasets/mp3d/.
  • For ScaleVLN: Download the HM3D scenes from the official github page, and place the train split under data/scene_datasets/hm3d/.

2、VLN-CE Episodes
Download the VLN-CE episodes and extract them into the data/datasets/ directory:

  • r2r (Rename R2R_VLNCE_v1-3_preprocessed/ -> r2r/)
  • rxr (Rename RxR_VLNCE_v0/ -> rxr/)
  • scalevln (Follow the StreamVLN to convert a subset of the ScaleVLN dataset into the VLN-CE format.)

3、Collected Trajectory Data

We provide pre-collected observation-action trajectory data for training. R2R and RxR are collected following VLN-CE. ScaleVLN is collected following StreamVLN. DAgger data is collected using JanusVLN_Base. Note: It is best to collect DAgger data using your own base model. Download the collected trajectory data from ModelScope and extract it to the data/trajectory_data/ and data/dagger_data/ directory.

Your final folder structure should look like this:

data/
├── datasets/
│   ├── r2r/
│   │   ├── train/
│   │   ├── val_seen/
│   │   │   └── val_seen.json.gz
│   │   └── val_unseen/
│   │       └── val_unseen.json.gz
│   ├── rxr/
│   │   ├── train/
│   │   ├── val_seen/
│   │   │   ├── val_seen_guide.json.gz
│   │   │   └── ...
│   │   └── val_unseen/
│   │       ├── val_unseen_guide.json.gz
│   │       └── ...
│   └── scalevln/
│       └── scalevln_subset_150k.json.gz
├── scene_datasets/
│   ├── hm3d/
│   │   ├── 00000-kfPV7w3FaU5/
│   │   ├── 00001-UVdNNRcVyV1/
│   │   └── ...
│   └── mp3d/
│       ├── 17DRP5sb8fy/
│       ├── 1LXtFkjw3qL/
│       └── ...
├── trajectory_data/
│   ├── R2R-CE-640x480/
│   │   └── images/   
│   ├── RxR-CE-640x480/
│   │   └── images/ 
│   └── ScaleVLN/
│       ├── images/
│       └── annotations.json
└── dagger_data/
    ├── R2R/
    │   ├── images/
    │   └── annotations.json
    └── RxR/
        ├── images/
        └── annotations.json

4、Build Datasets

Construct a base dataset that only includes R2R-CE and RxR-CE:

python create_data/create_data.py

Finally, the dataset information needs to be configured in the file src/qwen_vl/data/__init__.py.

🏆 Model Zoo

We have separately provided two sets of JanusVLN model weights to distinguish whether additional data is used or not:


Model Data Name
JanusVLN R2R-CE,RxR-CE JanusVLN_Base
R2R-CE,RxR-CE,DAgger,ScaleVLN JanusVLN_Extra

🚀 Training

  1. Base Training

    Use the base data to train the base model:

    bash scripts/train.sh
  2. Dagger Collection

    Collecting DAgger data using the base model:

    bash scripts/dagger.sh

    Construct extra dataset:

    python create_data/create_data.py --use_extra_data

    It is also necessary to configure the dataset information in the file src/qwen_vl/data/__init__.py.

  3. Extra Training

    Continue training on extra data on top of the base model:

    bash scripts/train_extra.sh

📈 Evaluation

Use multiple GPUs to infer the model for evaluation:

bash scripts/evaluation.sh

📜 Citing

If you find FSDrive is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry:

@article{zeng2025janusvln,
            title={JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation},
            author={Zeng, Shuang and Qi, Dekang and Chang, Xinyuan and Xiong, Feng and Xie, Shichao and Wu, Xiaolong and Liang, Shiyi and Xu, Mu and Wei, Xing},
            journal={arXiv preprint arXiv:2509.22548},
            year={2025}
            }

🙏 Acknowledgement

Our work is primarily based on the following codebases:Qwen2.5-VL, VGGT, StreamVLN, VG-LLM. We are sincerely grateful for their work.

About

Official implementation for "JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published