From 267cb67e862f602f06e2f46dad1a1d9d4eef678b Mon Sep 17 00:00:00 2001 From: francescotaioli Date: Mon, 2 Dec 2024 17:22:34 +0100 Subject: [PATCH] change page format --- index.html | 93 ++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 80 insertions(+), 13 deletions(-) diff --git a/index.html b/index.html index b552069..acac976 100644 --- a/index.html +++ b/index.html @@ -98,37 +98,50 @@

- Francesco Taioli1,2, Edoardo Zorzi2, Gianni FranchiGianni Franchi3, - Alberto Castellini2, - Alessandro Farinelli2, Marco Cristani2, Yiming WangYiming Wang4
@@ -205,7 +218,7 @@

alt="Teaser" />
-

+

Sketched episode of the proposed Collaborative Instance Navigation (CoIN) task. The human user (bottom left) provides a request ( >, producing a refined detection description. TheInteraction Trigger uses this refined + >. The Interaction Trigger uses this refined description to decide whether to pose a question to the user (①,③,④), continue the navigation (②) or halt the exploration (⑤). @@ -247,11 +260,65 @@

+
+
+
+
+
+
+
+
+

Abstract

+
+

+ Existing embodied instance goal navigation tasks, driven + by natural language, assume human users to provide + complete and nuanced instance descriptions prior to the + navigation, which can be impractical in the real world as + human instructions might be brief and ambiguous. +

+

+    To bridge this gap, we propose a new + task, Collaborative Instance Navigation (CoIN), with + dynamic agent-human interaction during navigation to + actively resolve uncertainties about the target instance + in natural, template-free, open-ended dialogues. +

+

+    To address CoIN, we propose a novel + method, Agent-user Interaction with UncerTainty Awareness + (AIUTA), leveraging the perception capability of Vision + Language Models (VLMs) and the capability of Large + Language Models (LLMs). First, upon object detection, a + Self-Questioner model initiates a self-dialogue to obtain + a complete and accurate observation description, while a + novel uncertainty estimation technique mitigates + inaccurate VLM perception. Then, an Interaction Trigger + module determines whether to ask a question to the user, + continue or halt navigation, minimizing user input. +

+ +

+    For evaluation, we introduce CoIN-Bench, + a benchmark supporting both real and simulated humans. + AIUTA achieves competitive performance in instance + navigation against state-of-the-art methods, demonstrating + great flexibility in handling user inputs. +

+
+
+
+
+
+
+
+
+
-
- +
+
-

Video

+

Method