This repo keeps track of the recent advances in Vision-and-Language Navigation research. Please check out our ACL 2022 VLN survey paper for the catogerization approach and the detailed discussions of tasks, methods, and future directions: Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions.
A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Vision-and-Language Navigation (VLN) is a fundamental and interdisciplinary research topic towards this goal, and receives increasing attention from natural language processing, computer vision, robotics, and machine learning communities. In this paper, we review contemporary studies in the emerging field of VLN, covering tasks, evaluation metrics, methods, etc. Through structured analysis of current progress and challenges, we highlight the limitations of current VLN and opportunities for future work. This paper serves as a thorough reference for the VLN research community.
- Awesome Vision-and-Language Navigation
-
[R2R]: Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
CVPR 2018 paper -
[CHAI]: Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction
EMNLP 2018 paper -
[LANI]: Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction
EMNLP 2018 paper -
Following High-level Navigation Instructions on a Simulated Quadcopter with Imitation Learning
RSS 2018 paper -
[RoomNav]: Building Generalizable Agents with a Realistic and Rich 3D Environment
arXiv 2018 paper -
[EmbodiedQA]: Embodied Question Answering
CVPR 2018 paper -
[IQA]: Iqa: Visual Question Answering in Interactive Environments
CVPR 2018 paper -
[Room-for-Room] Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
ACL 2019 paper -
[XL-R2R] Cross-Lingual Vision-Language Navigation
arXiv 2019 paper -
[Touchdown]: Natural Language Navigation and Spatial Reasoning in Visual Street Environments
CVPR 2019 paper -
The Streetlearn Environment and Dataset
arXiv 2019 paper -
Learning To Follow Directions in Street View
arXiv 2019 paper -
[Room-Across-Room]: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding EMNLP 2020 paper
-
[VLNCE] Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments
ECCV 2020 paper -
[Retouchdown]: Releasing Touchdown on StreetLearn as a Public Resource for Language Grounding Tasks in Street View
Spatial Language Understanding Workshop 2020 paper -
[REVERIE]: Remote Embodied Visual Referring Expression in Real Indoor Environments
CVPR 2020 paper -
[ALFRED]: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
CVPR 2020 paper -
[Landmark-RxR]: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision
NeurIPS 2021 paper -
Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation
ICRA 2021 [Project Page] [arXiv] [GitHub] -
[Talk2Nav]: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory
IJCV 2021 paper -
[Habitat-Matterport]: 1000 Large-scale 3D Environments for Embodied AI
Neurips 2021 paper -
[SOON]: Scenario Oriented Object Navigation with Graph-based Exploration
CVPR 2021 paper -
[ZInD]: Zillow Indoor Dataset: Annotated Floor Plans With 360o Panoramas and 3D Room Layouts
CVPR 2021 paper
-
[VNLA]: Vision-based Navigation with Language-based Assistance via Imitation Learning with Indirect Intervention
CVPR 2019 paper -
[HANNA]: Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning
EMNLP 2019 paper -
[CEREALBAR]: Executing Instructions in Situated Collaborative Interactions
ACL 2019 paper -
[Just Ask]: An Interactive Learning Framework for Vision and Language Navigation
AAAI 2020 paper
-
[Talk the Walk]: Navigating New York City through Grounded Dialogue
arXiv 2018 paper -
[CVDN]: Vision-and-Dialog Navigation
CoRL 2019 paper -
Collaborative Dialogue in Minecraft
ACL 2019 paper -
[RobotSlang]: The RobotSlang Benchmark: Dialog-guided Robot Localization and Navigation
CoRL 2020 paper -
[TEACh]: Task-driven Embodied Agents that Chat
AAAI 2022 paper -
[DialFRED]: Dialogue-enabled agents for embodied instruction following
RA-L 2022 paper -
[Don't Copy the Teacher]: EMNLP 2022 paper
-
[AVDN]: Aerial Vision-and-Dialog Navigation
ACL 2023 paper
Here we introduce papers that includes new evaluation metrics.
-
On Evaluation of Embodied Navigation Agents
arXiv 2018 paper -
Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments
CVPR 2019 paper -
Vision-and-Dialog Navigation
CoRL 2019 paper -
Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
ACL 2019 paper -
General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping
arXiv 2019 paper
-
Robust Navigation with Language Pretraining and Stochastic Sampling
EMNLP 2019 paper -
Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments
ECCV 2020 paper -
Improving Vision-and-Language Navigation with Image-Text Pairs from the Web
ECCV 2020 paper -
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training
CVPR 2020 paper -
Episodic Transformer for Vision-and-Language Navigation
ICCV 2021 paper -
The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation
ICCV 2021 paper -
A Recurrent Vision-and-Language BERT for Navigation
CVPR 2021 paper -
SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation
CVPR 2021 paper -
Airbert: In-domain Pretraining for Vision-and-Language Navigation
ICCV 2021 paper -
NDH-Full: Learning and Evaluating Navigational Agents on Full-Length Dialogue
EMNLP 2021 paper
-
Shifting the Baseline: Single Modality Performance on Visual Navigation & QA
ACL 2019 paper -
Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation
ACL 2019 paper -
Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters
BMVC 2019 paper -
Diagnosing the Environment Bias in Vision-and-Language Navigation
IJCAI 2020 paper -
Object-and-Action Aware Model for Visual Language Navigation
ECCV 2020 paper -
Diagnosing Vision-and-Language Navigation: What Really Matters
arXiv 2021 paper -
Room-and-Object Aware Knowledge Reasoning for Remote Embodied Referring Expression
CVPR 2021 paper -
Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning
IEEE CAS 2021 paper -
SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments
ICPR, 2022 [Paper] [Website] [Video] -
FILM: Following Instructions in Language with Modular Methods
ICLR 2022 [Paper] [Website] [Video] [Code] -
Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue
EMNLP 2022 [Paper] [Video]
-
Chasing Ghosts: Instruction Following as Bayesian State Tracking
NeurIPS 2019 paper -
Language and Visual Entity Relationship Graph for Agent Navigation
NeurIPS 2020 paper -
Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation
NeurIPS 2020 paper -
Topological Planning with Transformers for Vision-and-Language Navigation
CVPR 2021 paper
-
Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning
EMNLP 2019 paper -
Vision-Dialog Navigation by Exploring Cross-modal Memory
CVPR 2020 paper -
A Recurrent Vision-and-Language BERT for Navigation
CVPR 2021 paper -
Scene-Intuitive Agent for Remote Embodied Visual Grounding
CVPR 2021 paper -
History Aware Multimodal Transformer for Vision-and-Language Navigation
NeurIPS 2021 paper
-
Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
ICLR 2019 paper -
Transferable Representation Learning in Vision-and-Language Navigation
ICCV 2019 paper -
Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks
CVPR 2020 paper
-
Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation
ECCV 2018 paper -
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
CVPR 2019 paper -
Vision-language navigation policy learning and adaptation
TPAMI 2020 paper -
Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
ACL 2019 paper -
General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping
arXiv 2019 paper -
Perceive, transform, and act: Multi-modal attention networks for vision-and-language navigation
arXiv 2019 paper -
From language to goals: Inverse reinforcement learning for vision-based instruction following.
arXiv 2019 paper -
Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision
NeurIPS 2021 paper -
Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning
IEEE CAS 2021 paper
-
Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation
CVPR 2019 paper -
Active Visual Information Gathering for Vision-Language Navigation
ECCV 2020 paper -
Pathdreamer: A World Model for Indoor Navigation
ICCV 2021 paper
-
Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation
ECCV 2018 paper -
Chasing Ghosts: Instruction Following as Bayesian State Tracking
NeurIPS 2019 paper -
Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule
ICLR 2020 papepr -
Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation
EMNLP Findings 2020 paper -
Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation
ICRA 2021 [Project Page] [arXiv] [GitHub] -
Waypoint Models for Instruction-guided Navigation in Continuous Environments
ICCV 2021 paper -
Pathdreamer: A World Model for Indoor Navigation
ICCV 2021 paper -
Neighbor-view Enhanced Model for Vision and Language Navigation
arXiv 2021 paper -
Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments
EMNLP 2021 paper -
One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones
arXiv 2022 paper
-
CVDN: Vision-and-Dialog Navigation
CoRL 2019 paper -
Learning when and what to ask: a hierarchical reinforcement learning framework
EMNLP 2019 paper -
Just Ask:An Interactive Learning Framework for Vision and Language Navigation
AAAI 2020 paper -
RMM: A Recursive Mental Model for Dialog Navigation
EMNLP Findings 2020 paper -
Self-Motivated Communication Agent for Real-World Vision-Dialog Navigation
ICCV 2021 paper -
TEACh: Task-driven Embodied Agents that Chat
arXiv 2021 paper -
A Framework for Learning to Request Rich and Contextually Useful Information from Humans
arXiv 2021 paper
-
Speaker-Follower Models for Vision-and-Language Navigation
NeurIPS 2018 paper -
Multi-modal Discriminative Model for Vision-and-Language Navigation
SpLU&RoboNLP Workshop 2019 paper -
Transferable Representation Learning in Vision-and-Language Navigation
ICCV 2019 paper -
Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout
NAACL 2019 paper -
Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout
NAACL 2019 paper -
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling
ECCV 2020 paper -
Counterfactual vision-and-language navigation: Unravelling the unseen
NeurIPS 2020 paper -
Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation
EACL 2021 paper -
Vision-Language Navigation with Random Environmental Mixup
ICCV 2021 paper -
On the Evaluation of Vision-and-Language Navigation Instructions
EACL 2021 paper -
EnvEdit: Environment Editing for Vision-and-Language Navigation CVPR 2022 paper
-
AIGeN: An Adversarial Approach for Instruction Generation in VLN CVPRW 2024 paper
-
BabyWalk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps
ACL 2020 paper -
Curriculum Learning for Vision-and-Language Navigation
NeurIPS 2021 paper
-
Environment-agnostic Multitask Learning for Natural Language Grounded Navigation
ECCV 2020 paper -
Embodied Multimodal Multitask Learning
IJCAI 2020 paper
-
Multi-View Learning for Vision-and-Language Navigation
arXiv 2020 paper -
Sub-Instruction Aware Vision-and-Language Navigation
EMNLP 2020 paper -
Look wide and interpret twice: Improving performance on interactive instructionfollowing tasks
arXiv 2021 paper
-
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
CVPR 2019 paper -
Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout
NAACL 2019 paper -
Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation
ACL 2019 paper -
Counterfactual Vision-and-Language Navigation: Unravelling the Unseen
NeurIPS 2020 paper -
Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling
CVPR 2020 paper -
Topological Planning with Transformers for Vision-and-Language Navigation
CVPR 2021 paper -
Rethinking the Spatial Route Prior in Vision-and-Language Navigation
arXiv 2021 paper
-
Learning to follow navigational directions
ACL 2010 paper -
Learning to interpret natural language navigation instructions from observations
AAAI 2011 paper -
Run through the streets: A new dataset and baseline models for realistic urban navigation
EMNLP 2019 paper
-
Walk the talk: Connecting language, knowledge, and action in route instructions
AAAI 2006 paper -
Learning to Interpret Natural Language Navigation Instructions from Observations
AAAI 2011 paper -
Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight
PMLR 2020 paper
-
Target-driven visual navigation in indoor scenes using deep reinforcement learning
ICRA 2017 paper -
Learning to navigate
MULEA 2019 paper -
Learning to navigate in cities without a map
NeurIPS 2019 paper -
Deep Learning for Embodied Vision Navigation: A Survey
arXiv 2021 paper -
Self-Supervised Object Goal Navigation with In-Situ Finetuning
IROS 2023 paper video
@InProceedings{jing2022vln,
title={Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions},
author={Jing Gu and Eliana Stefani and Qi Wu and Jesse Thomason and Xin Eric Wang},
booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL)},
year = {2022}
}