JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation
Shuang Zeng1,2, Dekang Qi1, Xinyuan Chang1, Feng Xiong1, Shichao Xie1, Xiaolong Wu1, Shiyi Liang1,2, Mu Xu1, Xing Wei2
1Amap, Alibaba Group, 2Xi’an Jiaotong University
janusvln.mp4
JanusVLN is a novel VLN framework and the first to feature a dual implicit memory. Inspired by the implicit scene representation in human navigation, which integrates left-brain semantic understanding with right-brain spatial cognition, JanusVLN constructs two complementary, fixed-size, compact neural memory. JanusVLN steers VLN research from 2D semantics-dominant toward 3D spatial-semantisynergy, a critical direction for developing next-generation spatial embodied agents.
[2025-11-06] Due to the previous upload of incorrect weights for the JanusVLN_Extra model, if you need to directly infer, please download the correct weights from JanusVLN_Extra again.
Create the required environment through the following steps:
git clone https://github.com/MIV-XJTU/JanusVLN.git && cd JanusVLN
conda create -n janusvln python=3.9 -y && conda activate janusvln
conda install habitat-sim==0.2.4 withbullet headless -c conda-forge -c aihabitat
git clone --branch v0.2.4 https://github.com/facebookresearch/habitat-lab.git
cd habitat-lab
pip install -e habitat-lab
pip install -e habitat-baselines
cd ..
# CUDA 12.4
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
# Install JanusVLN
pip install -e .1、Scene Datasets
- For R2R, RxR: Download the MP3D scenes from the official project page, and place them under
data/scene_datasets/mp3d/. - For ScaleVLN: Download the HM3D scenes from the official github page, and place the
trainsplit underdata/scene_datasets/hm3d/.
2、VLN-CE Episodes
Download the VLN-CE episodes and extract them into the data/datasets/ directory:
- r2r (Rename
R2R_VLNCE_v1-3_preprocessed/->r2r/) - rxr (Rename
RxR_VLNCE_v0/->rxr/) - scalevln (Follow the StreamVLN to convert a subset of the ScaleVLN dataset into the VLN-CE format.)
3、Collected Trajectory Data
We provide pre-collected observation-action trajectory data for training. R2R and RxR are collected following VLN-CE. ScaleVLN is collected following StreamVLN. DAgger data is collected using JanusVLN_Base. Note: It is best to collect DAgger data using your own base model. Download the collected trajectory data from ModelScope and extract it to the data/trajectory_data/ and data/dagger_data/ directory.
Your final folder structure should look like this:
data/
├── datasets/
│ ├── r2r/
│ │ ├── train/
│ │ ├── val_seen/
│ │ │ └── val_seen.json.gz
│ │ └── val_unseen/
│ │ └── val_unseen.json.gz
│ ├── rxr/
│ │ ├── train/
│ │ ├── val_seen/
│ │ │ ├── val_seen_guide.json.gz
│ │ │ └── ...
│ │ └── val_unseen/
│ │ ├── val_unseen_guide.json.gz
│ │ └── ...
│ └── scalevln/
│ └── scalevln_subset_150k.json.gz
├── scene_datasets/
│ ├── hm3d/
│ │ ├── 00000-kfPV7w3FaU5/
│ │ ├── 00001-UVdNNRcVyV1/
│ │ └── ...
│ └── mp3d/
│ ├── 17DRP5sb8fy/
│ ├── 1LXtFkjw3qL/
│ └── ...
├── trajectory_data/
│ ├── R2R-CE-640x480/
│ │ └── images/
│ ├── RxR-CE-640x480/
│ │ └── images/
│ └── ScaleVLN/
│ ├── images/
│ └── annotations.json
└── dagger_data/
├── R2R/
│ ├── images/
│ └── annotations.json
└── RxR/
├── images/
└── annotations.json4、Build Datasets
Construct a base dataset that only includes R2R-CE and RxR-CE:
python create_data/create_data.pyFinally, the dataset information needs to be configured in the file src/qwen_vl/data/__init__.py.
We have separately provided two sets of JanusVLN model weights to distinguish whether additional data is used or not:
| Model | Data | Name |
|---|---|---|
| JanusVLN | R2R-CE,RxR-CE | JanusVLN_Base |
| R2R-CE,RxR-CE,DAgger,ScaleVLN | JanusVLN_Extra |
-
Base Training
Use the base data to train the base model:
bash scripts/train.sh
-
Dagger Collection
Collecting DAgger data using the base model:
bash scripts/dagger.sh
Construct extra dataset:
python create_data/create_data.py --use_extra_data
It is also necessary to configure the dataset information in the file
src/qwen_vl/data/__init__.py. -
Extra Training
Continue training on extra data on top of the base model:
bash scripts/train_extra.sh
Use multiple GPUs to infer the model for evaluation:
bash scripts/evaluation.shIf you find FSDrive is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry:
@article{zeng2025janusvln,
title={JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation},
author={Zeng, Shuang and Qi, Dekang and Chang, Xinyuan and Xiong, Feng and Xie, Shichao and Wu, Xiaolong and Liang, Shiyi and Xu, Mu and Wei, Xing},
journal={arXiv preprint arXiv:2509.22548},
year={2025}
}
Our work is primarily based on the following codebases:Qwen2.5-VL, VGGT, StreamVLN, VG-LLM. We are sincerely grateful for their work.
