Skip to content

EvenKuang/Multimodal-Embodied-AI

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 

Repository files navigation

⚙️ Multimodal-Embodied-AI

This repository collects papers, benchmarks, and datasets at the intersection of multimodal learning, embodied AI, and robotics.

It is continuously updated 🔥🔥. Contributions and suggestions from the community are highly welcome.

📋 Table of Contents

📄 Papers

Perception

Title Venue Website Code
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning NeurIPS 2025 Page Github
From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D NeurIPS 2025 Page Github
RaySt3R: Predicting Novel Depth Maps for Zero-Shot Object Completion NeurIPS 2025 Page Github
AimBot: A Simple Auxiliary Visual Cue to Enhance Spatial Awareness of Visuomotor Policies NeurIPS 2025 Page Github
EmbodiedSAM: Online Segment Any 3D Thing in Real Time ICLR 2025 Page Github
GrabS: Generative Embodied Agent for 3D Object Segmentation without Scene Supervision ICLR 2025 - Github
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics CVPR 2025 Page Github
Do vision-language models represent space and how? evaluating spatial frame of reference under ambiguities ICLR 2025 Page Github
Pre-training Auto-regressive Robotic Models with 4D Representations ICML 2025 Page Github
Hearing the Slide: Acoustic-Guided Constraint Learning for Fast Non-Prehensile Transport CASE 2025 Page -
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding Arxiv 2025 Page Github
Igniting VLMs toward the Embodied Space Arxiv 2025 Page Github
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics CoRL 2024 Page Github
SonicBoom: Contact Localization Using Array of Microphones RAL 2024 Page Github
Embodied Uncertainty-Aware Object Segmentation IROS 2024 Page Github
RoboMP2: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models ICML 2024 Page Github
OpenEQA: Embodied Question Answering in the Era of Foundation Models CVPR 2024 Page Github
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI CVPR 2024 Page Github
What’s Left? Concept Grounding with Logic-Enhanced Foundation Models NeurIPS 2023 Page Github
PACO: Parts and Attributes of Common Objects CVPR 2023 - Github
Interactron: Embodied Adaptive Object Detection CVPR 2022 - Github
Visuo-Acoustic Hand Pose and Contact Estimation Arxiv 2025 - -
Multimodal Perception for Goal-oriented Navigation: A Survey Arxiv 2025 - -

Reasoning

Title Venue Website Code
Gemini Robotics: Bringing AI into the Physical World Technical report - -
MolmoAct: Action Reasoning Models that can Reason in Space Technical report Page Github
ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge NeurIPS 2025 Page -
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning Technical report Page Github
SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning NeurIPS 2025 Page Github
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics NeurIPS 2025 Page Github
Magma: A Foundation Model for Multimodal AI Agents CVPR 2025 Page Github
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete CVPR 2025 Page Github
RoboBrain 2.0 Technical Report Technical report Page Github
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning Arxiv 2025 Page Github
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation Arxiv 2025 Page Github
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability CVPR 2025 - Github
ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation ICRA 2025 Page Github
InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning ACL 2025 Page Github
RoboVQA: Multimodal Long-Horizon Reasoning for Robotics ICRA 2024 Page Github
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models NeurIPS 2024 Page Github
Multi-modal Situated Reasoning in 3D Scenes NeurIPS 2024 Page Github
EQA-MX: Embodied Question Answering using Multimodal Expression ICLR 2024 - Github
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought NeurIPS 2023 Page Github
Inner Monologue: Embodied Reasoning through Planning with Language Models CoRL 2022 Page -
Robotic Control via Embodied Chain-of-Thought Reasoning Arxiv 2024 Page Github
Training Strategies for Efficient Embodied Reasoning Arxiv 2025 Page -

Planning

Title Venue Website Code
EMBODIEDBENCH: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents ICML 2025 Page Github
Embodied large language models enable robots to complete complex tasks in unpredictable environments Nature Machine Intelligence 2025 - -
DELTA: Decomposed Efficient Long-Term Robot Task Planning using Large Language Models ICRA 2025 Page Github
MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents ICLR 2025 - -
Learning for Long-Horizon Planning via Neuro-Symbolic Abductive Imitation KDD 2025 Page Github
Multimodal LLM Guided Exploration and Active Mapping using Fisher Information ICCV 2025 - -
Open-World Planning via Lifted Regression with LLM-Inferred Affordances for Embodied Agents ACL 2025 - -
Structured Preference Optimization for Vision-Language Long-Horizon Task Planning EMNLP 2025 - -
Hierarchical Vision-Language Planning for Multi-Step Humanoid Manipulation RSS Workshop 2025 Page -
Multi-Modal Grounded Planning and Efficient Replanning For Learning Embodied Agents with A Few Examples AAAI 2025 Page Github
Safe planner: Empowering safety awareness in large pre-trained models for robot task planning AAAI 2025 Page -
Pre-emptive Action Revision by Environmental Feedback for Embodied Instruction Following Agents CoRL 2024 Page Github
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning CVPR 2024 Page Github
RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation CVPR 2024 - -
Multimodal Procedural Planning via Dual Text-Image Prompting EMNLP 2024 - -
Embodied Agent Interface: A Single Line to Evaluate LLMs for Embodied Decision Making NeurIPS 2024 Page Github
Exploratory Retrieval-Augmented Planning For Continual Embodied Instruction Following NeurIPS 2024 - -
Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld CVPR 2024 - Github
Plan-seq-learn: Language model guided rl for solving long horizon robotics tasks ICLR 2024 Page Github
SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments AAAI 2024 Page Github
Learning Adaptive Planning Representations with Natural Language Guidance ICLR 2024 Page Github
What Planning Problem Can A Relational Neural Network Solve NeurIPS 2023 Page Github
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents NeurIPS 2023 Page Github
Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning CoRL 2023 Page -
Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control NeurIPS 2023 Page -
Do Embodied Agents Dream of Pixelated Sheep? Embodied Decision Making using Language Guided World Modelling ICML 2023 Page Github
Learning Neuro-Symbolic Skills for Bilevel Planning CoRL 2022 - Github
Learning Neuro-Symbolic Relational Transition Models for Bilevel Planning IROS 2022 - Github
LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models ICCV 2023 Page Github
Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents ICCV 2023 Page Github
Omnieva: Embodied versatile planner via task-adaptive 3d-grounded and embodiment-aware reasoning Arxiv 2025 Page -
Preference-Based Long-Horizon Robotic Stacking with Multimodal Large Language Models Arxiv 2025 - -
Reinforced Embodied Planning with Verifiable Reward for Real-World Robotic Manipulation Arxiv 2025 - -
Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning Arxiv 2023 Page -
Embodied Task Planning with Large Language Models Arxiv 2023 Page Github

Control

Manipulation

Title Venue Website Code
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success RSS 2026 Page Github
Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy AAAI 2026 - -
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning NeurIPS 2025 Page -
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models NeurIPS 2025 Page Github
What Can RL Bring to VLA Generalization? An Empirical Study NeurIPS 2025 Page Github
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control CoRL 2025 Page Github
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data CoRL 2025 Page Github
RoboBERT: An End-to-end Multimodal Robotic Manipulation Model CoRL 2025 Page Github
Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation CoRL 2025 Page -
Learning to Act Anywhere with Task-centric Latent Actions RSS 2025 - Github
ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy RSS 2025 Page Github
Gemini Robotics: Bringing AI into the Physical World Technical report - -
π_0: A Vision-Language-Action Flow Model for General Robot Control RSS 2025 Page Github
π_0.5: a Vision-Language-Action Model with Open-World Generalization CORL 2025 Page Github
*π^0.6: a VLA that Learns from Experience Arxiv 2025 Page Github
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy ICCV 2025 Page Github
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation ICLR 2025 Page Github
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models CVPR 2025 Page -
RoboGround: Robotic Manipulation with Grounded Vision-Language Priors CVPR 2025 Page Github
Magma: A Foundation Model for Multimodal AI Agents CVPR 2025 Page Github
OpenVLA: An Open-Source Vision-Language-Action Model CoRL 2024 Page Github
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation ICLR 2024 Page Github
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation Technical report Page -
GR-3 Technical Report Technical report Page -
MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting RSS 2024 Page Github
RT-1: Robotics Transformer for Real-World Control at Scale RSS 2023 Page Github
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control Technical report Page Github
Octo: An Open-Source Generalist Robot Policy RSS 2024 Page Github
Open X-Embodiment: Robotic Learning Datasets and RT-X Models ICRA 2024 Page Github
VIMA: General Robot Manipulation with Multimodal Prompts ICML 2023 Page Github
Policy Blending and Recombination for Multimodal Contact-Rich Tasks RAL 2021 - -

Navigation

Title Venue Website Code
NaVILA: Legged Robot Vision-Language-Action Model for Navigation RSS 2025 Page Github
Multimodal Spatial Language Maps for Robot Navigation and Manipulation IJRR 2025 Page Github
ApexNav: An Adaptive Exploration Strategy for Zero-Shot Object Navigation with Target-centric Semantic Fusion RA-L 2025 Page Github
Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild CoRL 2025 Page Github
GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation CoRL 2025 Page Github
RoboTron-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction ICCV 2025 Page Github
FLAME: Learning to Navigate with Multimodal LLM in Urban Environments AAAI 2025 Page Github
WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation IROS 2025 Page Github
SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation IROS 2025 Page -
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation RSS 2024 Page Github
LLaDA: Driving Everywhere with Large Language Model Policy Adaptation CVPR 2024 Page Github
Adaptive Zone-aware Hierarchical Planner CVPR 2024 - Github

📊 Benchmarks and Datasets

Perception

Title Venue Website Code
From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D NeurIPS 2025 Page Github
Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments CVPR 2025 - Github
RCP-Bench: Robust Collaborative Perception Framework CVPR 2025 - Github
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models CVPR 2025 Page Github
TartanGround: A Large-Scale Dataset for Ground Robot Perception and Navigation IROS 2025 Page Github
HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction ISER 2025 - Github
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI CVPR 2024 Page Github
MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception CVPR 2024 Page Github
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations NeurIPS 2024 Page Github
Tiny Robotics Dataset and Benchmark for Continual Object Detection Arxiv 2024 - Github
Perception Test: A Diagnostic Benchmark for Multimodal Video Models NeurIPS 2023 Page Github
Robo3D: Towards Robust and Reliable 3D Perception against Corruptions ICCV 2023 Page Github
Benchmarking Robustness of 3D Object Detection to Common Corruptions in Autonomous Driving CVPR 2023 - Github
Ost-bench: Evaluating The Capabilities Of Mllms In Online Spatio-temporal Scene Understanding Arxiv 2025 Page Github
A Unified Perception Benchmark for Capacitive Proximity Sensing Towards Safe Human-Robot Collaboration (HRC) ICRA 2021 - -
JRDB: A Dataset and Benchmark of Egocentric Robot Visual Perception of Humans in Built Environments TPAMI 2019 Page -
BOP: Benchmark for 6D Object Pose Estimation ECCV 2018 Page -

Reasoning

Title Venue Website Code
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics NeurIPS 2025 Page Github
Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets NeurIPS 2025 Page Github
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving ICCV 2025 Page Github
Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering ICCV 2025 Page Github
Embodied Reasoning QA Evaluation Dataset Technical report Page Github
ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation ICRA 2025 Page Github
SpatialBot: Precise Spatial Understanding with Vision Language Models ICRA 2025 - Github
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces CVPR 2025 Page Github
PhysBench Benchmarking and Enhancing VLMs for Physical World Understanding ICLR 2025 Page Github
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models Arxiv 2025 Page Github
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence Arxiv 2025 Page Github
ViLaSR: Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing Arxiv 2025 - Github
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models NeurIPS 2024 Page Github
OpenEQA: Embodied Question Answering in the Era of Foundation Models CVPR 2024 Page Github
EQA-MX: Embodied Question Answering using Multimodal Expression ICLR 2024 - Github
RoboVQA: Multimodal Long-Horizon Reasoning for Robotics ICRA 2024 Page Github
EgoTaskQA: Understanding Human Tasks in Egocentric Videos NeurIPS 2022 Page Github

Planning

Title Venue Website Code
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents ICML 2025 Page Github
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks ICLR 2025 Page Github
OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis NeurIPS 2025 - Github
ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models IROS 2025 - Github
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents Arxiv 2025 Page Github
WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning Arxiv 2025 Page Github
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making NeurIPS 2024 Page Github
Large Language Models as Generalizable Policies for Embodied Tasks ICLR 2024 Page Github
LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents ICLR 2024 Page Github
HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments ICLR 2024 Page Github
MFE-ETP: A Comprehensive Evaluation Benchmark for Multi-modal Foundation Models on Embodied Task Planning Arxiv 2024 Page Github
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios Arxiv 2024 Page Github
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents Arxiv 2024 - Github
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning Arxiv 2023 Page Github
Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots Arxiv 2023 Page Github
Habitat 2.0: Training Home Assistants to Rearrange their Habitat NeurIPS 2021 Page Github
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks CVPR 2020 Page Github
Habitat: A Platform for Embodied AI Research ICCV 2019 Page Github

Control

Manipulation

Title Venue Website Code
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation RSS 2025 Page Github
ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI RSS 2025 Page Github
RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning RSS 2025 Page Github
Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation RSS 2025 Page -
ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation CoRL 2025 Page Github
ManiFeel: Benchmarking and Understanding Visuotactile Manipulation Policy Learning CoRL 2025 Page -
AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems IROS 2025 Page Github
Gembench: Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy ICRA 2025 Page Github
RoboCerebra:A Large-scale Benchmark for Long-Horizon Robotic Manipulation Evaluation NeurIPS 2025 Page Github
VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks ICCV 2025 Page Github
DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover ICCV 2025 Page Github
ManiSkill-HAB: A Benchmark for Low-Level Manipulation in Home Rearrangement Tasks ICLR 2025 Page Github
GENESIS: A generative world for general-purpose robotics & embodied AI learning - Page Github
RoboCAS: A Benchmark for Robotic Manipulation in Complex Object Arrangement Scenarios NeurIPS 2024 - Github
Towards Diverse Behaviors: A Benchmark for Imitation Learning with Human Demonstrations ICLR 2024 Page Github
FetchBench: A Simulation Benchmark for Robot Fetching CoRL 2024 - Github
DexGraspNet 2.0: Learning Generative Dexterous Grasping in Large-scale Synthetic Cluttered Scenes CoRL 2024 Page Github
Open X-Embodiment: Robotic Learning Datasets and RT-X Models ICRA 2024 Page Github
RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot ICRA 2024 Page Github
SceneReplica: Benchmarking Real-World Robot Manipulation by Creating Replicable Scenes ICRA 2024 Page Github
Grasp-Anything: Large-scale Grasp Dataset from Foundation Models ICRA 2024 Page Github
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots RSS 2024 Page Github
SimplerEnv: Simulated Manipulation Policy Evaluation Environments for Real Robot Setups CoRL 2024 Page Github
HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation Arxiv 2024 Page Github
HomeRobot: Open-Vocabulary Mobile Manipulation CoRL 2023 Page Github
DexGraspNet: A Large-Scale Robotic Dexterous Grasp Dataset for General Objects Based on Simulation ICRA 2023 Page Github
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning NeurIPS 2023 Page Github
RoboHive: A Unified Framework for Robot Learning NeurIPS 2023 Page Github
ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes ICCV 2023 Page Github
ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills ICLR 2023 Page Github
DaXBench: Benchmarking Deformable Object Manipulation with Differentiable Physics ICLR 2023 Page Github
VIMA: General Robot Manipulation with Multimodal Prompts ICML 2023 Page Github
Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments RA-L 2023 Page Github
CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks RA-L 2022 Page Github
BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation CoRL 2022 Page Github
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets RSS 2022 Page Github
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation CoRL 2021 Page Github
PlasticineLab: A Soft-Body Manipulation Benchmark with Differentiable Physics ICLR 2021 Page Github
ManiSkill: Generalizable Manipulation Skill Benchmark with Large-Scale Demonstrations NeurIPS 2021 - Github
DexYCB: A Benchmark for Capturing Hand Grasping of Objects CVPR 2021 Page Github
RLBench: The Robot Learning Benchmark & Learning Environment RA-L 2020 Page Github
Benchmarking In-Hand Manipulation RA-L 2020 Page -
GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping CVPR 2020 Page Github
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning Arxiv 2020 Page Github

Navigation

Title Venue Website Code
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives ICCV 2025 Page Github
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method CVPR 2025 Page Github
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning CVPR 2025 Page Github
CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving WACV 2025 Page -
FlightBench: Benchmarking Learning-based Methods for Ego-vision-based Quadrotors Navigation RA-L 2025 Page Github
Memory-Maze: Scenario Driven Benchmark and Visual Language Navigation Model for Guiding Blind People RA-L 2025 - -
GND: Global Navigation Dataset with Multi-Modal Perception and Multi-Category Traversability in Outdoor Campus Environments ICRA 2025 Page Github
Language Prompt for Autonomous Driving AAAI 2025 - Github
Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions NeurIPS 2024 Page Github
HM3D-OVON: A Dataset and Benchmark for Open-Vocabulary Object Goal Navigation IROS 2024 Page Github
Generalized Predictive Model for Autonomous Driving CVPR 2024 - Github
GOAT-Bench: A Benchmark for Multi-modal Lifelong Navigation CVPR 2024 Page Github
Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning WACV 2024 Page -
BenchNav: Simulation Platform for Benchmarking Off-road Navigation Algorithms with Probabilistic Traversability Arxiv 2024 - Github
Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology Arxiv 2024 Page Github
Toward Human-Like Social Robot Navigation: A Large-Scale, Multi-Modal, Social Human Navigation Dataset IROS 2023 Page Github
Benchmarking Visual Localization for Autonomous Navigation WACV 2023 Page Github
GOAT: GO to Any Thing Arxiv 2023 Page Github
RobustNav: Towards Benchmarking Robustness in Embodied Navigation ICCV 2021 Page Github
MultiON: Benchmarking Semantic Map Memory using Multi-Object Navigation NeurIPS 2020 Page Github
The RobotSlang Benchmark: Dialog-guided Robot Localization and Navigation CoRL 2020 Page Github
Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments ECCV 2020 Page Github
Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding EMNLP 2020 Page Github
Explainable Object-induced Action Decision for Autonomous Vehicles CVPR 2020 Page Github
nuScenes: A multimodal dataset for autonomous driving CVPR 2020 Page -
The apolloscape open dataset for autonomous driving and its application TPAMI 2019 Page Github
Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments CVPR 2018 Page -

About

A list of papers at the intersection of multimodal learning, embodied AI, and robotics.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors