⚙️ Multimodal-Embodied-AI

This repository collects papers, benchmarks, and datasets at the intersection of multimodal learning, embodied AI, and robotics.

It is continuously updated 🔥🔥. Contributions and suggestions from the community are highly welcome.

📋 Table of Contents

📄 Papers
- Perception
- Reasoning
- Planning
- Control
  - Manipulation
  - Navigation
📊 Benchmarks and Datasets
- Perception
- Reasoning
- Planning
- Control
  - Manipulation
  - Navigation

📄 Papers

Perception

Title	Venue	Website	Code
SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning	NeurIPS 2025	Page	Github
From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D	NeurIPS 2025	Page	Github
RaySt3R: Predicting Novel Depth Maps for Zero-Shot Object Completion	NeurIPS 2025	Page	Github
AimBot: A Simple Auxiliary Visual Cue to Enhance Spatial Awareness of Visuomotor Policies	NeurIPS 2025	Page	Github
EmbodiedSAM: Online Segment Any 3D Thing in Real Time	ICLR 2025	Page	Github
GrabS: Generative Embodied Agent for 3D Object Segmentation without Scene Supervision	ICLR 2025	-	Github
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics	CVPR 2025	Page	Github
Do vision-language models represent space and how? evaluating spatial frame of reference under ambiguities	ICLR 2025	Page	Github
Pre-training Auto-regressive Robotic Models with 4D Representations	ICML 2025	Page	Github
Hearing the Slide: Acoustic-Guided Constraint Learning for Fast Non-Prehensile Transport	CASE 2025	Page	-
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding	Arxiv 2025	Page	Github
Igniting VLMs toward the Embodied Space	Arxiv 2025	Page	Github
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics	CoRL 2024	Page	Github
SonicBoom: Contact Localization Using Array of Microphones	RAL 2024	Page	Github
Embodied Uncertainty-Aware Object Segmentation	IROS 2024	Page	Github
RoboMP2: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models	ICML 2024	Page	Github
OpenEQA: Embodied Question Answering in the Era of Foundation Models	CVPR 2024	Page	Github
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI	CVPR 2024	Page	Github
What’s Left? Concept Grounding with Logic-Enhanced Foundation Models	NeurIPS 2023	Page	Github
PACO: Parts and Attributes of Common Objects	CVPR 2023	-	Github
Interactron: Embodied Adaptive Object Detection	CVPR 2022	-	Github
Visuo-Acoustic Hand Pose and Contact Estimation	Arxiv 2025	-	-
Multimodal Perception for Goal-oriented Navigation: A Survey	Arxiv 2025	-	-

Reasoning

Title	Venue	Website	Code
Gemini Robotics: Bringing AI into the Physical World	Technical report	-	-
MolmoAct: Action Reasoning Models that can Reason in Space	Technical report	Page	Github
ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge	NeurIPS 2025	Page	-
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning	Technical report	Page	Github
SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning	NeurIPS 2025	Page	Github
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics	NeurIPS 2025	Page	Github
Magma: A Foundation Model for Multimodal AI Agents	CVPR 2025	Page	Github
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete	CVPR 2025	Page	Github
RoboBrain 2.0 Technical Report	Technical report	Page	Github
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning	Arxiv 2025	Page	Github
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation	Arxiv 2025	Page	Github
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability	CVPR 2025	-	Github
ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation	ICRA 2025	Page	Github
InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning	ACL 2025	Page	Github
RoboVQA: Multimodal Long-Horizon Reasoning for Robotics	ICRA 2024	Page	Github
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models	NeurIPS 2024	Page	Github
Multi-modal Situated Reasoning in 3D Scenes	NeurIPS 2024	Page	Github
EQA-MX: Embodied Question Answering using Multimodal Expression	ICLR 2024	-	Github
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought	NeurIPS 2023	Page	Github
Inner Monologue: Embodied Reasoning through Planning with Language Models	CoRL 2022	Page	-
Robotic Control via Embodied Chain-of-Thought Reasoning	Arxiv 2024	Page	Github
Training Strategies for Efficient Embodied Reasoning	Arxiv 2025	Page	-

Planning

Title	Venue	Website	Code
EMBODIEDBENCH: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents	ICML 2025	Page	Github
Embodied large language models enable robots to complete complex tasks in unpredictable environments	Nature Machine Intelligence 2025	-	-
DELTA: Decomposed Efficient Long-Term Robot Task Planning using Large Language Models	ICRA 2025	Page	Github
MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents	ICLR 2025	-	-
Learning for Long-Horizon Planning via Neuro-Symbolic Abductive Imitation	KDD 2025	Page	Github
Multimodal LLM Guided Exploration and Active Mapping using Fisher Information	ICCV 2025	-	-
Open-World Planning via Lifted Regression with LLM-Inferred Affordances for Embodied Agents	ACL 2025	-	-
Structured Preference Optimization for Vision-Language Long-Horizon Task Planning	EMNLP 2025	-	-
Hierarchical Vision-Language Planning for Multi-Step Humanoid Manipulation	RSS Workshop 2025	Page	-
Multi-Modal Grounded Planning and Efficient Replanning For Learning Embodied Agents with A Few Examples	AAAI 2025	Page	Github
Safe planner: Empowering safety awareness in large pre-trained models for robot task planning	AAAI 2025	Page	-
Pre-emptive Action Revision by Environmental Feedback for Embodied Instruction Following Agents	CoRL 2024	Page	Github
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning	CVPR 2024	Page	Github
RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation	CVPR 2024	-	-
Multimodal Procedural Planning via Dual Text-Image Prompting	EMNLP 2024	-	-
Embodied Agent Interface: A Single Line to Evaluate LLMs for Embodied Decision Making	NeurIPS 2024	Page	Github
Exploratory Retrieval-Augmented Planning For Continual Embodied Instruction Following	NeurIPS 2024	-	-
Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld	CVPR 2024	-	Github
Plan-seq-learn: Language model guided rl for solving long horizon robotics tasks	ICLR 2024	Page	Github
SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments	AAAI 2024	Page	Github
Learning Adaptive Planning Representations with Natural Language Guidance	ICLR 2024	Page	Github
What Planning Problem Can A Relational Neural Network Solve	NeurIPS 2023	Page	Github
Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents	NeurIPS 2023	Page	Github
Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning	CoRL 2023	Page	-
Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control	NeurIPS 2023	Page	-
Do Embodied Agents Dream of Pixelated Sheep? Embodied Decision Making using Language Guided World Modelling	ICML 2023	Page	Github
Learning Neuro-Symbolic Skills for Bilevel Planning	CoRL 2022	-	Github
Learning Neuro-Symbolic Relational Transition Models for Bilevel Planning	IROS 2022	-	Github
LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models	ICCV 2023	Page	Github
Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents	ICCV 2023	Page	Github
Omnieva: Embodied versatile planner via task-adaptive 3d-grounded and embodiment-aware reasoning	Arxiv 2025	Page	-
Preference-Based Long-Horizon Robotic Stacking with Multimodal Large Language Models	Arxiv 2025	-	-
Reinforced Embodied Planning with Verifiable Reward for Real-World Robotic Manipulation	Arxiv 2025	-	-
Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning	Arxiv 2023	Page	-
Embodied Task Planning with Large Language Models	Arxiv 2023	Page	Github

Control

Manipulation

Title	Venue	Website	Code
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success	RSS 2026	Page	Github
Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy	AAAI 2026	-	-
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning	NeurIPS 2025	Page	-
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models	NeurIPS 2025	Page	Github
What Can RL Bring to VLA Generalization? An Empirical Study	NeurIPS 2025	Page	Github
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control	CoRL 2025	Page	Github
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data	CoRL 2025	Page	Github
RoboBERT: An End-to-end Multimodal Robotic Manipulation Model	CoRL 2025	Page	Github
Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation	CoRL 2025	Page	-
Learning to Act Anywhere with Task-centric Latent Actions	RSS 2025	-	Github
ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy	RSS 2025	Page	Github
Gemini Robotics: Bringing AI into the Physical World	Technical report	-	-
π_0: A Vision-Language-Action Flow Model for General Robot Control	RSS 2025	Page	Github
π_0.5: a Vision-Language-Action Model with Open-World Generalization	CORL 2025	Page	Github
π^0.6: a VLA that Learns from Experience*	Arxiv 2025	Page	Github
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy	ICCV 2025	Page	Github
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation	ICLR 2025	Page	Github
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models	CVPR 2025	Page	-
RoboGround: Robotic Manipulation with Grounded Vision-Language Priors	CVPR 2025	Page	Github
Magma: A Foundation Model for Multimodal AI Agents	CVPR 2025	Page	Github
OpenVLA: An Open-Source Vision-Language-Action Model	CoRL 2024	Page	Github
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation	ICLR 2024	Page	Github
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation	Technical report	Page	-
GR-3 Technical Report	Technical report	Page	-
MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting	RSS 2024	Page	Github
RT-1: Robotics Transformer for Real-World Control at Scale	RSS 2023	Page	Github
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control	Technical report	Page	Github
Octo: An Open-Source Generalist Robot Policy	RSS 2024	Page	Github
Open X-Embodiment: Robotic Learning Datasets and RT-X Models	ICRA 2024	Page	Github
VIMA: General Robot Manipulation with Multimodal Prompts	ICML 2023	Page	Github
Policy Blending and Recombination for Multimodal Contact-Rich Tasks	RAL 2021	-	-

Navigation

Title	Venue	Website	Code
NaVILA: Legged Robot Vision-Language-Action Model for Navigation	RSS 2025	Page	Github
Multimodal Spatial Language Maps for Robot Navigation and Manipulation	IJRR 2025	Page	Github
ApexNav: An Adaptive Exploration Strategy for Zero-Shot Object Navigation with Target-centric Semantic Fusion	RA-L 2025	Page	Github
Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild	CoRL 2025	Page	Github
GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation	CoRL 2025	Page	Github
RoboTron-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction	ICCV 2025	Page	Github
FLAME: Learning to Navigate with Multimodal LLM in Urban Environments	AAAI 2025	Page	Github
WMNav: Integrating Vision-Language Models into World Models for Object Goal Navigation	IROS 2025	Page	Github
SmartWay: Enhanced Waypoint Prediction and Backtracking for Zero-Shot Vision-and-Language Navigation	IROS 2025	Page	-
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation	RSS 2024	Page	Github
LLaDA: Driving Everywhere with Large Language Model Policy Adaptation	CVPR 2024	Page	Github
Adaptive Zone-aware Hierarchical Planner	CVPR 2024	-	Github

📊 Benchmarks and Datasets

Perception

Title	Venue	Website	Code
From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D	NeurIPS 2025	Page	Github
Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments	CVPR 2025	-	Github
RCP-Bench: Robust Collaborative Perception Framework	CVPR 2025	-	Github
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models	CVPR 2025	Page	Github
TartanGround: A Large-Scale Dataset for Ground Robot Perception and Navigation	IROS 2025	Page	Github
HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction	ISER 2025	-	Github
EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI	CVPR 2024	Page	Github
MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception	CVPR 2024	Page	Github
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations	NeurIPS 2024	Page	Github
Tiny Robotics Dataset and Benchmark for Continual Object Detection	Arxiv 2024	-	Github
Perception Test: A Diagnostic Benchmark for Multimodal Video Models	NeurIPS 2023	Page	Github
Robo3D: Towards Robust and Reliable 3D Perception against Corruptions	ICCV 2023	Page	Github
Benchmarking Robustness of 3D Object Detection to Common Corruptions in Autonomous Driving	CVPR 2023	-	Github
Ost-bench: Evaluating The Capabilities Of Mllms In Online Spatio-temporal Scene Understanding	Arxiv 2025	Page	Github
A Unified Perception Benchmark for Capacitive Proximity Sensing Towards Safe Human-Robot Collaboration (HRC)	ICRA 2021	-	-
JRDB: A Dataset and Benchmark of Egocentric Robot Visual Perception of Humans in Built Environments	TPAMI 2019	Page	-
BOP: Benchmark for 6D Object Pose Estimation	ECCV 2018	Page	-

Reasoning

Title	Venue	Website	Code
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics	NeurIPS 2025	Page	Github
Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets	NeurIPS 2025	Page	Github
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving	ICCV 2025	Page	Github
Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering	ICCV 2025	Page	Github
Embodied Reasoning QA Evaluation Dataset	Technical report	Page	Github
ReMEmbR: Building and Reasoning Over Long-Horizon Spatio-Temporal Memory for Robot Navigation	ICRA 2025	Page	Github
SpatialBot: Precise Spatial Understanding with Vision Language Models	ICRA 2025	-	Github
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces	CVPR 2025	Page	Github
PhysBench Benchmarking and Enhancing VLMs for Physical World Understanding	ICLR 2025	Page	Github
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models	Arxiv 2025	Page	Github
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence	Arxiv 2025	Page	Github
ViLaSR: Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing	Arxiv 2025	-	Github
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models	NeurIPS 2024	Page	Github
OpenEQA: Embodied Question Answering in the Era of Foundation Models	CVPR 2024	Page	Github
EQA-MX: Embodied Question Answering using Multimodal Expression	ICLR 2024	-	Github
RoboVQA: Multimodal Long-Horizon Reasoning for Robotics	ICRA 2024	Page	Github
EgoTaskQA: Understanding Human Tasks in Egocentric Videos	NeurIPS 2022	Page	Github

Planning

Title	Venue	Website	Code
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents	ICML 2025	Page	Github
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks	ICLR 2025	Page	Github
OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis	NeurIPS 2025	-	Github
ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models	IROS 2025	-	Github
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents	Arxiv 2025	Page	Github
WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning	Arxiv 2025	Page	Github
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making	NeurIPS 2024	Page	Github
Large Language Models as Generalizable Policies for Embodied Tasks	ICLR 2024	Page	Github
LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents	ICLR 2024	Page	Github
HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments	ICLR 2024	Page	Github
MFE-ETP: A Comprehensive Evaluation Benchmark for Multi-modal Foundation Models on Embodied Task Planning	Arxiv 2024	Page	Github
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios	Arxiv 2024	Page	Github
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents	Arxiv 2024	-	Github
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning	Arxiv 2023	Page	Github
Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots	Arxiv 2023	Page	Github
Habitat 2.0: Training Home Assistants to Rearrange their Habitat	NeurIPS 2021	Page	Github
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks	CVPR 2020	Page	Github
Habitat: A Platform for Embodied AI Research	ICCV 2019	Page	Github

Control

Manipulation

Title	Venue	Website	Code
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation	RSS 2025	Page	Github
ManiSkill3: GPU Parallelized Robotics Simulation and Rendering for Generalizable Embodied AI	RSS 2025	Page	Github
RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning	RSS 2025	Page	Github
Dex1B: Learning with 1B Demonstrations for Dexterous Manipulation	RSS 2025	Page	-
ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation	CoRL 2025	Page	Github
ManiFeel: Benchmarking and Understanding Visuotactile Manipulation Policy Learning	CoRL 2025	Page	-
AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems	IROS 2025	Page	Github
Gembench: Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy	ICRA 2025	Page	Github
RoboCerebra：A Large-scale Benchmark for Long-Horizon Robotic Manipulation Evaluation	NeurIPS 2025	Page	Github
VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks	ICCV 2025	Page	Github
DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover	ICCV 2025	Page	Github
ManiSkill-HAB: A Benchmark for Low-Level Manipulation in Home Rearrangement Tasks	ICLR 2025	Page	Github
GENESIS: A generative world for general-purpose robotics & embodied AI learning	-	Page	Github
RoboCAS: A Benchmark for Robotic Manipulation in Complex Object Arrangement Scenarios	NeurIPS 2024	-	Github
Towards Diverse Behaviors: A Benchmark for Imitation Learning with Human Demonstrations	ICLR 2024	Page	Github
FetchBench: A Simulation Benchmark for Robot Fetching	CoRL 2024	-	Github
DexGraspNet 2.0: Learning Generative Dexterous Grasping in Large-scale Synthetic Cluttered Scenes	CoRL 2024	Page	Github
Open X-Embodiment: Robotic Learning Datasets and RT-X Models	ICRA 2024	Page	Github
RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot	ICRA 2024	Page	Github
SceneReplica: Benchmarking Real-World Robot Manipulation by Creating Replicable Scenes	ICRA 2024	Page	Github
Grasp-Anything: Large-scale Grasp Dataset from Foundation Models	ICRA 2024	Page	Github
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots	RSS 2024	Page	Github
SimplerEnv: Simulated Manipulation Policy Evaluation Environments for Real Robot Setups	CoRL 2024	Page	Github
HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation	Arxiv 2024	Page	Github
HomeRobot: Open-Vocabulary Mobile Manipulation	CoRL 2023	Page	Github
DexGraspNet: A Large-Scale Robotic Dexterous Grasp Dataset for General Objects Based on Simulation	ICRA 2023	Page	Github
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning	NeurIPS 2023	Page	Github
RoboHive: A Unified Framework for Robot Learning	NeurIPS 2023	Page	Github
ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes	ICCV 2023	Page	Github
ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills	ICLR 2023	Page	Github
DaXBench: Benchmarking Deformable Object Manipulation with Differentiable Physics	ICLR 2023	Page	Github
VIMA: General Robot Manipulation with Multimodal Prompts	ICML 2023	Page	Github
Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments	RA-L 2023	Page	Github
CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks	RA-L 2022	Page	Github
BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation	CoRL 2022	Page	Github
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets	RSS 2022	Page	Github
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation	CoRL 2021	Page	Github
PlasticineLab: A Soft-Body Manipulation Benchmark with Differentiable Physics	ICLR 2021	Page	Github
ManiSkill: Generalizable Manipulation Skill Benchmark with Large-Scale Demonstrations	NeurIPS 2021	-	Github
DexYCB: A Benchmark for Capturing Hand Grasping of Objects	CVPR 2021	Page	Github
RLBench: The Robot Learning Benchmark & Learning Environment	RA-L 2020	Page	Github
Benchmarking In-Hand Manipulation	RA-L 2020	Page	-
GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping	CVPR 2020	Page	Github
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning	Arxiv 2020	Page	Github

Navigation

Title	Venue	Website	Code
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives	ICCV 2025	Page	Github
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method	CVPR 2025	Page	Github
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning	CVPR 2025	Page	Github
CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving	WACV 2025	Page	-
FlightBench: Benchmarking Learning-based Methods for Ego-vision-based Quadrotors Navigation	RA-L 2025	Page	Github
Memory-Maze: Scenario Driven Benchmark and Visual Language Navigation Model for Guiding Blind People	RA-L 2025	-	-
GND: Global Navigation Dataset with Multi-Modal Perception and Multi-Category Traversability in Outdoor Campus Environments	ICRA 2025	Page	Github
Language Prompt for Autonomous Driving	AAAI 2025	-	Github
Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions	NeurIPS 2024	Page	Github
HM3D-OVON: A Dataset and Benchmark for Open-Vocabulary Object Goal Navigation	IROS 2024	Page	Github
Generalized Predictive Model for Autonomous Driving	CVPR 2024	-	Github
GOAT-Bench: A Benchmark for Multi-modal Lifelong Navigation	CVPR 2024	Page	Github
Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning	WACV 2024	Page	-
BenchNav: Simulation Platform for Benchmarking Off-road Navigation Algorithms with Probabilistic Traversability	Arxiv 2024	-	Github
Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology	Arxiv 2024	Page	Github
Toward Human-Like Social Robot Navigation: A Large-Scale, Multi-Modal, Social Human Navigation Dataset	IROS 2023	Page	Github
Benchmarking Visual Localization for Autonomous Navigation	WACV 2023	Page	Github
GOAT: GO to Any Thing	Arxiv 2023	Page	Github
RobustNav: Towards Benchmarking Robustness in Embodied Navigation	ICCV 2021	Page	Github
MultiON: Benchmarking Semantic Map Memory using Multi-Object Navigation	NeurIPS 2020	Page	Github
The RobotSlang Benchmark: Dialog-guided Robot Localization and Navigation	CoRL 2020	Page	Github
Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments	ECCV 2020	Page	Github
Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding	EMNLP 2020	Page	Github
Explainable Object-induced Action Decision for Autonomous Vehicles	CVPR 2020	Page	Github
nuScenes: A multimodal dataset for autonomous driving	CVPR 2020	Page	-
The apolloscape open dataset for autonomous driving and its application	TPAMI 2019	Page	Github
Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments	CVPR 2018	Page	-

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚙️ Multimodal-Embodied-AI

📋 Table of Contents

📄 Papers

Perception

Reasoning

Planning

Control

Manipulation

Navigation

📊 Benchmarks and Datasets

Perception

Reasoning

Planning

Control

Manipulation

Navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

⚙️ Multimodal-Embodied-AI

📋 Table of Contents

📄 Papers

Perception

Reasoning

Planning

Control

Manipulation

Navigation

📊 Benchmarks and Datasets

Perception

Reasoning

Planning

Control

Manipulation

Navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages