Natural-language instance navigation becomes challenging when the initial user request does not uniquely specify the target instance. A practical agent should reduce the user's burden by asking only the information needed to distinguish the target from similar distractors, rather than requiring a detailed description upfront.
We propose Proactive Instance Navigation with Comparative Judgment (ProCompNav), a two-stage framework that first constructs a candidate pool and then identifies the target through comparative judgment: at each round it extracts an attribute-value pair that splits the current pool, asks a binary yes/no question, and prunes all inconsistent candidates at once.
This reframes disambiguation from open-ended target description to pool-level discriminative questioning, where each question is chosen to narrow the candidate set. On CoIN-Bench, ProCompNav improves Success Rate over interactive baselines with the same minimal input and non-interactive baselines with detailed descriptions, while substantially reducing Response Length; on TextNav, it also achieves the highest SR, suggesting that comparative judgment is broadly useful for instance-level navigation among similar distractors.
Prior work on collaborative instance navigation typically follows independent matching, scoring each candidate against accumulated facts. A natural way to reduce premature decisions is to defer the choice until several candidates are collected (pooled independent matching). ProCompNav goes further with comparative judgment: at each round it picks an attribute-value pair that splits the current pool and asks a single yes/no question to prune it.
Figure 1. Three strategies for instance navigation under an ambiguous user query. (a) Independent Matching scores each candidate independently, causing premature decision to a distractor sharing attributes with the true target. (b) Pooled Independent Matching defers the decision until multiple candidates are collected, but non-discriminative questions still fail to separate similar distractors, while imposing high user burden. (c) Comparative Judgment (Ours) proactively builds a candidate pool and asks binary questions about discriminative attributes derived from candidate contrasts, accurately identifying the target with minimal user burden.
Table 1 isolates the contribution of each ingredient on CoIN-Bench. Adding a candidate pool to independent matching (b) lifts SR but inflates user response length 4×; replacing independent matching with comparative judgment (c) lifts SR further while collapsing Response Length from 460–520 to 4.2–4.3 tokens.
| Decision Strategy | Stage | Val Seen | Val Seen Synonyms | Val Unseen | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Pool | Compare | SR↑ | RL↓ | NQ↓ | SR↑ | RL↓ | NQ↓ | SR↑ | RL↓ | NQ↓ | |
| (a) Independent Matching [Taioli et al., 2025] | ✗ | ✗ | 10.5 | 109.5 | 1.2 | 15.3 | 129.2 | 1.2 | 8.9 | 122.8 | 1.3 |
| (b) Pooled Independent Matching | ✓ | ✗ | 17.5 | 460.2 | 3.6 | 22.0 | 519.2 | 3.7 | 13.3 | 467.7 | 3.4 |
| (c) Comparative Judgment (Ours) | ✓ | ✓ | 23.7 | 4.2 | 2.2 | 28.1 | 4.3 | 2.2 | 17.0 | 4.2 | 2.3 |
Table 1. Comparison of three disambiguation strategies on CoIN-Bench. We report Success Rate (SR), average total Response Length (RL), and average Number of Questions (NQ) per episode.
ProCompNav runs in two stages. In the Pool Construction Stage, the agent explores the environment and aggregates multi-view evidence for each category-c detection into a candidate pool with a multi-view collage and description per candidate. Once the pool has Nmin candidates, it transitions to the Recursive Comparison Stage (Fig. 2), which prunes the pool one binary question at a time.
Figure 2. Recursive Comparative Judgment. At iteration t, ProCompNav splits the candidate pool Ut into a core set Gc and a remainder set Gr by similarity. It identifies a discriminative attribute at*, that is common in Gc but not in Gr. Finally, it asks whether the target has at*, and prunes the pool to obtain the next candidate pool Ut+1 based on the user's response.
ProCompNav achieves the highest SR on all CoIN-Bench splits with only category-level input. Relative to AIUTA*, SR improves by +125.7% on Val Seen, +83.7% on Val Seen Synonyms, and +91.0% on Val Unseen. It also surpasses training-free baselines that consume a detailed description of the target (3D-Mem*, Context-Nav) on every split, despite using only the category.
| Method | Condition | Val Seen | Val Seen Synonyms | Val Unseen | ||||
|---|---|---|---|---|---|---|---|---|
| Judge | Interact | SR↑ | SPL↑ | SR↑ | SPL↑ | SR↑ | SPL↑ | |
| Training-based | ||||||||
| Monolithic-GOAT [Khanna et al., 2024]† | – | ✗ | 6.6 | 3.1 | 13.1 | 6.5 | 0.2 | 0.1 |
| PSL [Sun et al., 2024]† | – | ✗ | 8.8 | 3.3 | 8.9 | 2.8 | 4.6 | 1.4 |
| Training-free (description input) | ||||||||
| 3D-Mem* [Yang et al., 2025] | Comp | ✗ | 15.9 | 11.0 | 18.9 | 13.6 | 5.7 | 3.9 |
| Context-Nav [Jang and Kim, 2026]‡ | Indep | ✗ | 13.5 | 6.7 | 20.3 | 10.9 | 11.3 | 5.2 |
| Training-free (category input) | ||||||||
| VLFM [Yokoyama et al., 2024]† | Indep | ✗ | 0.4 | 0.3 | 0.0 | 0.0 | 0.0 | 0.0 |
| AIUTA [Taioli et al., 2025]† | Indep | ✓ | 7.4 | 2.9 | 14.4 | 8.0 | 6.7 | 2.3 |
| AIUTA* | Indep | ✓ | 10.5 | 4.5 | 15.3 | 8.4 | 8.9 | 4.0 |
| Pooled Independent Matching | Indep | ✓ | 17.5 | 5.1 | 22.0 | 8.1 | 13.3 | 5.1 |
| ProCompNav (Ours) | Comp | ✓ | 23.7 | 7.0 | 28.1 | 8.5 | 17.0 | 6.2 |
Table 2. Results on CoIN-Bench. Indep/Comp refer to independent/comparative judgment. *Reproduced using the same MLLM and LLM as ProCompNav. †Taken from [Taioli et al., 2025]. ‡Taken from [Jang and Kim, 2026].
AIUTA often terminates at the two extremes: either very early, suggesting premature commitment to a plausible distractor, or only at the maximum step, suggesting failure to disambiguate the target. In contrast, ProCompNav most frequently terminates after collecting candidates and rarely waits until the maximum step. The cumulative success curves show that ProCompNav reaches 108 successes by the 200–299 bin, already exceeding AIUTA's final total of 87, and continues to accumulate successes in later bins — overall, ProCompNav reaches more than twice the final number of successes of AIUTA.
Figure 3. Termination-step analysis of AIUTA and ProCompNav. The x-axis shows termination steps in 100-step bins, except the max exploration step; bars (left y-axis) show number of terminated episodes, and lines (right y-axis) show cumulative number of successful episodes.
Although ProCompNav is designed for interactive instance navigation, it generalises to the non-interactive Text-Goal Instance Navigation (TextNav) benchmark by extracting attributes directly from the detailed target description provided at episode start instead of asking the user. ProCompNav achieves the highest SR (28.5%), suggesting that its core mechanism — deferring the target decision until a candidate pool is constructed and then pruning it with discriminative attributes — is also effective in the non-interactive setting, where a detailed description is available but candidates still need to be compared against each other rather than scored independently.
| Method | Judgment | SR↑ | SPL↑ |
|---|---|---|---|
| Training-based | |||
| Modular-GOAT [Khanna et al., 2024]† | – | 17.0 | 8.8 |
| PSL [Sun et al., 2024]† | – | 16.5 | 7.5 |
| Training-free | |||
| UniGoal [Yin et al., 2025]† | Indep | 20.2 | 11.4 |
| AIUTA* [Taioli et al., 2025] | Indep | 10.9 | 2.7 |
| 3D-Mem* [Yang et al., 2025] | Comp | 14.1 | 9.6 |
| Context-Nav [Jang and Kim, 2026]‡ | Indep | 26.2 | 9.1 |
| ProCompNav (Ours) | Comp | 28.5 | 6.9 |
Table 3. Performance on TextNav. In the Judgment column, Indep/Comp indicate whether each method uses independent matching or comparative judgment for target disambiguation. Results denoted by † are taken from [Yin et al., 2025], those denoted by ‡ are taken from [Jang and Kim, 2026]. Methods denoted by * are our reproductions using the same MLLM and LLM as ProCompNav.
Under the minimally specified query "Find the dresser", the independent-matching baseline asks the user to describe an attribute of the intended target without knowing whether that attribute distinguishes it from distractors; this burdens the user with a long response and can lead the robot to mistakenly match a distractor sharing the same attribute. In contrast, ProCompNav asks about discriminative attributes (e.g., "Is there a red box nearby?") in a way that elicits only short user responses, reducing verbal burden while correctly isolating the user-intended target.
Figure 4. Qualitative comparison of Independent Matching and Comparative Judgment under a minimally specified query ("Find the dresser"). Independent Matching (left) asks the user to describe an attribute of the intended target without knowing whether that attribute distinguishes it from distractors. This burdens the user with a long response and can lead the robot to mistakenly match a distractor sharing the same attribute. In contrast, Comparative Judgment (right) asks about discriminative attributes in a way that elicits only short user responses, reducing verbal burden while correctly isolating the user-intended target.
@article{kwon2026proactive,
title={Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries},
author={Kwon, Junhyuk and Lee, Seungjoon and Park, Hyejin and Min, Kyle and Ok, Jungseul},
journal={arXiv preprint arXiv:2605.06223},
year={2026}
}
This codebase is built upon VLFM and AIUTA / CoIN. We thank the authors for their excellent work and for releasing their code and benchmarks.