Understanding how humans cooperatively utilize semantic knowledge to explore unfamiliar environments and decide on navigation directions is critical for house service multi-robot systems. Previous methods primarily focused on single-robot centralized planning strategies, which severely limited exploration efficiency. Recent research has considered decentralized planning strategies for multiple robots, assigning separate planning models to each robot, but these approaches often overlook communication costs. In this work, we propose Multimodal Chain-of-Thought Co-Navigation (MCoCoNav), a modular approach that utilizes multimodal Chain-of-Thought to plan collaborative semantic navigation for multiple robots. MCoCoNav combines visual perception with Vision Language Models (VLMs) to evaluate exploration value through probabilistic scoring, thus reducing time costs and achieving stable outputs. Additionally, a global semantic map is used as a communication bridge, minimizing communication overhead while integrating observational results. Guided by scores that reflect exploration trends, robots utilize this map to assess whether to explore new frontier points or revisit history nodes. Experiments on HM3D_v0.2 and MP3D demonstrate the effectiveness of our approach.

Enhancing Multi-Robot Semantic Navigation Through Multimodal Chain-of-Thought Score Collaboration

Existing semantic segmentation methods face challenges when processing input images degraded by raindrops on the lens or windshield. Unlike other adverse conditions such as fog and nighttime, which degrade visual quality, raindrops not only impair visual appearances but also introduce misleading occlusion, leading to significant performance drops in current models. The novelty of our approach lies in our two-stage, dual teacher-student framework. We tackle the complex problem of raindrop degradation by dividing it into two distinct challenges: degraded visual appearance and raindrop occlusion. These challenges are then addressed individually in two stages, utilizing two pairs of teacher-student networks. This division enables the networks to develop specialized expertise in handling each aspect of raindrop degradation, enabling their collaboration to achieve superior performance. In the first stage, one teacher-student pair focuses on learning to extract information from visual degraded areas. Building on this, the second teacher-student pair focuses specially on the raindrop occlusion. As such, unlike the existing methods, our approach employs a collaborative approach to decompose and address raindrop-induced degradations. In the second stage, we introduce a mask-based recovery technique to identify and rectify areas that likely contain misleading information, thus further refining the predictions. Additionally, this stage encourages both pairs to expand knowledge by swapping their specialized expertise. Our method achieves a performance of 60.3 mIoU on Rainy WCity and 72.8 mIoU on ACDC Rainy, representing an improvement of +4.4 mIoU and +2.3 mIoU over the existing state-of-the-art methods, respectively.

Semantic Segmentation on Raindrop Degraded Images Using Two-Stage Dual Teacher-Student Learning

Humans naturally rely on floor plans to navigate in unfamiliar environments because they are easy to access, consistently reliable, and rich in geometric detail. However, existing visual navigation settings neglect such informative prior knowledge, limiting the efficiency and accuracy. To eliminate this gap, we introduce a novel navigation task: Floor Plan Visual Navigation (FloNa), the first attempt to incorporate floor plan into the embodied visual navigation. Although the floor plan offers informative knowledge, two additional challenges arise: (1) how to map observed images and the floor plan sketch, given their different modalities, and (2) how to avoid collision with scene furniture that the floor plan does not depict. To address these issues, we propose FloDiff, a novel diffusion policy framework incorporating a localization module to facilitate the matching between the current observation and the floor plan. We further collect 20k navigation episodes from 117 scenes in the iGibson simulator to support the training and evaluation. The extensive experiments show the effectiveness and efficiency of our proposed framework in unfamiliar scenes using floor plan knowledge.

FloNa: Floor Plan Guided Embodied Visual Navigation

As climate change reshapes global weather patterns, the increasing frequency and intensity of extreme rainfall events have amplified the safety imperatives for autonomous driving systems. During such events, rainfall can escalate from heavy to violent, as defined by the World Meteorological Organization, severely impairing images with diverse and significant degradations. Many existing semantic segmentation models perform well under light to heavy rain, but there is a notable absence of datasets addressing violent rain conditions for these models to validate and learn from. In this paper, we introduce the Extreme RainFall (ERF) dataset for semantic segmentation in both image and video tasks under violent rain conditions. Our dataset comprises 14,757 unlabeled frames and 100 labeled frames, all captured during four different violent rainfall periods. We use our dataset to evaluate the robustness of various methods against violent rainfall, focusing on four approaches: 1) image-based foundation models, 2) image-based domain generalization methods, 3) image-based domain adaptation methods, and 4) video-based methods. The results reveal that none of the existing models tested is capable of withstanding the extreme challenges posed by violent rainfall conditions. By analyzing the results, we offer insights and suggestions for developing more robust models under extreme rainfall events.

ERF: A Benchmark Dataset for Robust Semantic Segmentation Under Extreme Rainfall Conditions

Inferring the 3D structure of a scene from a single image is an ill-posed and challenging problem in the field of vision-centric autonomous driving. Existing methods usually employ neural radiance fields to produce voxelized 3D occupancy, lacking instance-level semantic reasoning and temporal photometric consistency. In this paper, we propose ViPOcc, which leverages the visual priors from vision foundation models (VFMs) for fine-grained 3D occupancy prediction. Unlike previous works that solely employ volume rendering for RGB and depth image reconstruction, we introduce a metric depth estimation branch, in which an inverse depth alignment module is proposed to bridge the domain gap in depth distribution between VFM predictions and the ground truth. The recovered metric depth is then utilized in temporal photometric alignment and spatial geometric alignment to ensure accurate and consistent 3D occupancy prediction. Additionally, we also propose a semantic-guided non-overlapping Gaussian mixture sampler for efficient, instance-aware ray sampling, which addresses the redundant and imbalanced sampling issue that still exists in previous state-of-the-arts. Extensive experiments demonstrate the superior performance of ViPOcc in both 3D occupancy prediction and depth estimation tasks on the KITTI-360 and KITTI Raw datasets. Our source code will be released upon publication.

ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction

3D Object Affordance Grounding aims to predict the functional regions on a 3D object and has laid the foundation for a wide range of applications in robotics. Recent advances tackle this problem via learning a mapping between 3D regions and a single human-object interaction image. However, the geometric structure of the 3D object and the object in the human-object interaction image are not always consistent, leading to poor generalization. To address this issue, we propose to learn generalizable invariant affordance knowledge from multiple human-object interaction images within the same affordance category. Specifically, we introduce the \textbf{M}ulti-\textbf{I}mage Guided Invariant-\textbf{F}eature-Aware 3D \textbf{A}ffordance \textbf{G}rounding (\textbf{MIFAG}) framework. It grounds 3D object affordance regions by identifying common interaction patterns across multiple human-object interaction images. First, the Invariant Affordance Knowledge Extraction Module (\textbf{IAM}) utilizes an iterative updating strategy to gradually extract aligned affordance knowledge from multiple images and integrate it into an affordance dictionary. Then, the Affordance Dictionary Adaptive Fusion Module (\textbf{ADM}) learns comprehensive point cloud representations that consider all affordance candidates in multiple images. Besides, the Multi-Image and Point Affordance (\textbf{MIPA}) benchmark is constructed and our method outperforms existing state-of-the-art methods on various experimental comparisons. The code and dataset will be released upon acceptance.

Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding

Multi-robot task planning and collaboration are crucial challenges in robotics. While Behavior Trees (BTs) have been established as a popular control architecture and are plannable for a single robot, the development of effective multi-robot BT planning algorithms remains challenging due to the complexity of coordinating diverse action spaces. We propose the Multi-Robot Behavior Tree Planning (MRBTP) algorithm, with theoretical guarantees of both soundness and completeness. MRBTP features cross-tree expansion to coordinate heterogeneous actions across different BTs to achieve the team's goal. For homogeneous actions, we retain backup structures among BTs to ensure robustness and prevent redundant execution through intention sharing. While MRBTP is capable of generating BTs for both homogeneous and heterogeneous robot teams, its efficiency can be further improved. We then propose an optional plugin for MRBTP when Large Language Models (LLMs) are available to reason goal-related actions for each robot. These relevant actions can be pre-planned to form long-horizon subtrees, significantly enhancing the planning speed and collaboration efficiency of MRBTP. We evaluate our algorithm in warehouse management and everyday service scenarios. Results demonstrate MRBTP's robustness and execution efficiency under varying settings, as well as the ability of the pre-trained LLM to generate effective task-specific subtrees for MRBTP.

MRBTP: Efficient Multi-Robot Behavior Tree Planning and Collaboration

Natural language is the most intuitive means for humans to interact with robots, making task planning based on natural language commands a longstanding area of research. Large language models (LLMs) have significantly improved task planning by enhancing understanding of language and common sense. However, current methods still face several challenges: they lack a deep understanding of physical environments, their performance relies heavily on prompt examples, LLMs are oversized and not customized for specific tasks, and the planning costs remain high. 
To overcome these issues, we introduce the GNN-Transformer Task Planner (GTTP), designed to predict task-level actions by leveraging the semantic environment and incorporating historical state data. The GTTP architecture is scalable through the use of GNN layers, while transformer layers facilitate understanding task progression. In addition, our model uses a text encoder to embed environments, allowing it to be trained on simulated datasets and applied directly in real-world scenarios. We also propose an automated data generation method that includes semantic augmentation, planning verification, and instruction generation via LLM. This method enables the collection of 14$k$ instruction-annotated tasks in the VirtualHome environment with minimal human effort. The model has been validated across diverse scenes containing up to 715 objects, achieving significantly higher success rates compared to baseline models. It has also been successfully deployed on a physical mobile manipulator, demonstrating its practical applicability and effectiveness. Our datasets and code will be made publicly available.

GNN-Transformer Task Planning Enhanced with Semantic-Driven Data Augmentation

Learning a perception and reasoning module for robotic assistants to plan steps to perform complex tasks based on natural language instructions often requires large free-form language annotations, especially for short high-level instructions. To reduce the cost of annotation, large language models (LLMs) are used as a planner with few data. However, when elaborating the steps, even the state-of-the-art planner that uses LLMs mostly relies on linguistic common sense, often neglecting the status of the environment at command reception, resulting in inappropriate plans. To generate plans grounded in the environment, we propose FLARE (Few-shot Language with environmental Adaptive Replanning Embodied agent), which improves task planning using both language command and environmental perception. As language instructions often contain ambiguities or incorrect expressions, we additionally propose to correct the mistakes using visual cues from the agent. The proposed scheme allows us to use a few language pairs thanks to the visual cues and significantly outperforms state-of-the-art approaches (by twice the success rate in unseen environments of the ALFRED benchmark: $16.42$\% $\to$ $40.88$\%).

Multi-Modal Grounded Planning and Efficient Replanning for Learning Embodied Agents with a Few Examples

Image-guided object assembly represents a burgeoning research topic in computer vision. This paper introduces a novel task: translating multi-view images of a structural 3D model (for example, one constructed with building blocks drawn from a 3D-object library) into a detailed sequence of assembly instructions executable by a robotic arm. Fed with multi-view images of the target 3D model for replication, the model designed for this task must address several sub-tasks, including recognizing individual components used in constructing the 3D model, estimating the geometric pose of each component, and deducing a feasible assembly order adhering to physical rules. Establishing accurate 2D-3D correspondence between multi-view images and 3D objects is technically challenging. To tackle this, we propose an end-to-end model known as the Neural Assembler. This model learns an object graph where each vertex represents recognized components from the images, and the edges specify the topology of the 3D model, enabling the derivation of an assembly plan. We establish benchmarks for this task and conduct comprehensive empirical evaluations of Neural Assembler and alternative solutions. Our experiments clearly demonstrate the superiority of Neural Assembler.

Neural Assembler: Learning to Generate Fine-Grained Robotic Assembly Instructions from Multi-View Images

Enabling humanoid robots to perform long-horizon mobile manipulation planning in real-world environments based on embodied perception and comprehension abilities has been a longstanding challenge. 
With the recent rise of large language models (LLMs), there has been a notable increase in the development of LLM-based planners. These approaches either utilize human-provided textual representations of the real world or heavily depend on prompt engineering to extract such representations, lacking the capability to quantitatively understand the environment, such as determining the feasibility of manipulating objects.
To address these limitations, we present the Instruction-Augmented Long-Horizon Planning (IALP) system, a novel framework that employs LLMs to generate feasible and optimal actions based on real-time sensor feedback, including grounded knowledge of the environment, in a closed-loop interaction. Distinct from prior works, our approach augments user instructions into PDDL problems by leveraging both the abstract reasoning capabilities of LLMs and grounding mechanisms.
By conducting various real-world long-horizon tasks, each consisting of seven distinct manipulatory skills, our results demonstrate that the IALP system can efficiently solve these tasks with an average success rate exceeding 80\%.
Our proposed method can operate as a high-level planner for open-world long-horizon tasks, equipping robots with substantial autonomy in unstructured environments through the utilization of multi-modal sensor inputs.

Instruction-Augmented Long-Horizon Planning: Embedding Grounding Mechanisms in Embodied Mobile Manipulation

Humans navigate effectively unfamiliar environments with capabilities of episodic simulation and episodic memory. Building imagination-based memory, analogous to episodic simulation and episodic memory, enhances embodied agents' comprehension of the interrelationship between environments and objects, thereby improving their vision-and-language navigation (VLN) performance. However, the existing agent fails to perform the aforementioned mechanism. We propose the first architecture to help robots build a maintainable and fine-grained imaginative memory system. Specifically, the agent can maintain a reality-imagination hybrid global memory during navigation and expand the memory map through imaginative mechanisms and navigation actions. Correspondingly, we design a series of pre-training tasks to help the agent acquire fine-grained imaginative abilities. Our agents improve the best success rate (SR) by 7% while simultaneously generating high-fidelity imagery with low Pearson correlation coefficients.

Underline digital video library

Enhancing Multi-Robot Semantic Navigation Through Multimodal Chain-of-Thought Score Collaboration

Semantic Segmentation on Raindrop Degraded Images Using Two-Stage Dual Teacher-Student Learning

FloNa: Floor Plan Guided Embodied Visual Navigation

ERF: A Benchmark Dataset for Robust Semantic Segmentation Under Extreme Rainfall Conditions

ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction

Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding

MRBTP: Efficient Multi-Robot Behavior Tree Planning and Collaboration

GNN-Transformer Task Planning Enhanced with Semantic-Driven Data Augmentation

Multi-Modal Grounded Planning and Efficient Replanning for Learning Embodied Agents with a Few Examples

Neural Assembler: Learning to Generate Fine-Grained Robotic Assembly Instructions from Multi-View Images

Instruction-Augmented Long-Horizon Planning: Embedding Grounding Mechanisms in Embodied Mobile Manipulation

Planning from Imagination: Episodic Simulation and Episodic Memory for Vision-and-Language Navigation

Stay up to date with the latest Underline news!

PRODUCT

COMPANY

RESOURCES