We introduce a novel task, Vision-and-Language Driving (VLD), aiming to enable vehicles to follow long-horizon human natural-language instructions to autonomously navigate in traffic environments.
The paper of this work is under review.
This paper builds a dataset for the VLD task. We use CARLA 0.9.15 to collect data with the Leaderboard 2.0 framework in Town 12. We design driving routes by configuring starting points, waypoints, and end points.
The vehicle is equipped with multiple sensors: four RGB cameras, four semantic cameras, four depth cameras, and one LiDAR. Our sensor placement is developed from the DriveLM placement scheme, which is shown in the figure below. The specifications of the sensors is displayed in the table below.
Please use the password provided in the manuscript to access the download pages and unzip the zip files.
The full VLD dataset: download link
The preprocessed raw data: download link
Note: The preprocessed raw data contains additional information collected from the CARLA simulator. You may develop your own datasets based on the raw materials for your own tasks.
A mini sample data with a single route: download link
The following video is an RGB recording of a vehicle's driving route selected from the dataset, with the instruction:
Go straight along the current road, turn left at the T-junction after passing a blue-purple kiosk, then go straight, turn right at the T-junction after passing a white plastic table, then go straight, turn left at the T-junction at the end of the road, then go straight, turn left at the crossroads after passing a Coca-Cola vending machine, then go straight, turn right at the T-junction at the end of the road, then go straight and stop near the mailbox on the right side of the road.
A.collected.route.in.the.VLD.dataset.mp4
For more details, please refer to the paper.
The prediction results:
