CARLA Data Collection Framework

A comprehensive framework for collecting and processing autonomous driving data using CARLA simulator, built on state-of-the-art rule-based driving policies and advanced data augmentation techniques.

Overview

This repository provides a complete pipeline for:

Data Collection: Autonomous driving scenarios using PDM-Lite, a rule-based privileged expert system
Data Processing: Format conversion (pre-DriveFusion, DriveFusion formats), dataset cleaning, and validation
Language Generation: Vision-Language Model (VLM) annotation with VQA and scene descriptions
Multi-Modal Dataset Creation: Camera images, LiDAR point clouds, semantic segmentation, depth estimation, and measurements

Key Features

Rule-Based Autopilot: PDM-Lite achieves state-of-the-art performance on CARLA Leaderboard 2.0
Multi-Town Support: Supports CARLA towns including old_towns, Town 12, and Town 13
Multi-Camera System: Front, front-left, front-right, back-left, back-right camera perspectives
Advanced Sensors: LiDAR, semantic segmentation, depth estimation, bounding box generation
Weather Augmentation: Dynamic weather and lighting conditions for robust data
Batch Processing: SLURM cluster support for large-scale data generation
Format Flexibility: Multiple output formats for different downstream tasks

Project Structure

carla-data-collection/
├── carla_data_collection/
│   ├── carla_data_generation/    # Core data collection pipeline
│   │   ├── team_code/            # Autopilot and data agent implementations
│   │   ├── scenario_runner/       # CARLA scenario execution (from CARLA)
│   │   └── leaderboard/           # CARLA Leaderboard 2.0 (modified)
│   ├── drivefusion_formatters/    # Format converters for DriveFusion format
│   ├── clean_dataset/             # Data cleaning utilities
│   ├── load_measurements/         # Measurement file processing
│   ├── language_generation/       # VQA and language annotation
│   └── format_testers/            # Format validation tools
├── Inference-v0.2/                # Inference and model deployment
├── constants.py                   # Project-wide configuration
└── requirements.txt               # Python dependencies

Getting Started

Prerequisites

Python 3.8+ (<=3.10 recommended)
CARLA 0.9.15 simulator
GPU recommended (NVIDIA with CUDA support)
Sufficient disk space for datasets (depends on collection size)

Installation

Clone the repository:

git clone https://github.com/DriveFusion/carla-data-collection.git
cd carla-data-collection

Set up CARLA (automated setup available):

cd carla_data_collection/carla_data_generation
chmod +x setup_carla.sh
./setup_carla.sh

Create Python environment:

conda env create -f carla_data_collection/carla_data_generation/environment.yml
conda activate carla-datacol

Install dependencies:
```
pip install -r requirements.txt
```

Configure environment variables (add to .bashrc or .zshrc):

export CARLA_ROOT=/path/to/CARLA
export WORK_DIR=/path/to/carla-data-collection
export PYTHONPATH=$PYTHONPATH:${CARLA_ROOT}/PythonAPI
export PYTHONPATH=$PYTHONPATH:${CARLA_ROOT}/PythonAPI/carla
export SCENARIO_RUNNER_ROOT=${WORK_DIR}/carla_data_collection/carla_data_generation/scenario_runner
export LEADERBOARD_ROOT=${WORK_DIR}/carla_data_collection/carla_data_generation/leaderboard

Update configuration (if needed):
- Modify constants.py to match your dataset paths
- Adjust carla_data_collection/team_code/config.py for autopilot parameters

Usage

Data Collection

Local Collection (single machine):

cd carla_data_collection/carla_data_generation
bash run_pdm_lite_local.sh

Cluster Collection (SLURM):

python collect_dataset_slurm.py

Set environment variables in the script:

DATAGEN=1: Enable data collection mode
DEBUG_CHALLENGE=1: Display agent visualizations
PTH_ROUTE: Path to route file
PTH_LOG: Output directory for results

Data Processing

Format Conversion (to DriveFusion format):

python carla_data_collection/drivefusion_formatters/drivefusion_formatter.py

Dataset Cleaning:

python carla_data_collection/clean_dataset/clean.py

Measurements Generation:

python carla_data_collection/load_measurements/measurements_generator.py

Language Annotation

Generate VQA annotations:

python carla_data_collection/language_generation/language_labels/drivelm/carla_vqa_generator_main.py \
    --data-root /path/to/dataset \
    --output-dir /path/to/output

Core Components

PDM-Lite Autopilot

Rule-based driving policy achieving state-of-the-art CARLA Leaderboard 2.0 performance:

Route Planner: Navigation using pre-computed routes
Lateral Controller: PID-based steering control
Longitudinal Controller: Linear regression-based speed control
Kinematic Bicycle Model: Vehicle dynamics simulation

Performance Metrics (CARLA Leaderboard 2.0):

Dataset	Success Rate	Infraction Score	Driving Score
DevTest (10 seeds)	100.0%	0.41/0.59	40.8/58.5
Validation (3 seeds)	91.3%	0.41	36.3
Training (1 seed)	98.8%	0.49	48.5
Bench2Drive (3 seeds)	98.8%	0.98	97.0

Data Agent

Extends autopilot with comprehensive data collection:

Multi-camera recording (RGB + augmented)
LiDAR point cloud generation
Semantic segmentation maps
Depth estimation
Bounding box extraction with vehicle attributes
Measurements logging (speed, steering, acceleration, etc.)

Format Specifications

Pre-DriveFusion Format (pre_drivefusion_format.json.gz):

Raw sensor data with measurements
Compressed JSON for efficient storage

DriveFusion Format (drivefusion_format.json.gz):

Standardized structure for end-to-end models
Includes VQA annotations and scene descriptions

External Sources & Attribution

This project builds upon several open-source repositories and community contributions:

Core Frameworks & Tools

CARLA Simulator: Open-source autonomous driving simulator
CARLA Leaderboard 2.0: Autonomous driving competition platform (MIT License)
Scenario Runner: CARLA scenario execution framework (MIT License)

Reference Implementations

carla_garage / Transfuser++: State-of-the-art CARLA agent architectures
- Inspired lateral and longitudinal control implementations
- Vehicle dynamics modeling approaches

Research & Methods

DriveLM: Original DriveLM framework with PDM-Lite implementation
- Core rule-based autopilot implementation
- Data collection and augmentation strategies
- VQA generation methodology

Data Standards

nuScenes: Multi-modal autonomous driving dataset format
- Dataset format conversion reference
- Evaluation metrics inspiration

Dependencies

Key Python packages:

Deep Learning: torch, torchvision
Data Processing: numpy, opencv-python, pillow, scipy, shapely
Utilities: carla, tqdm, ujson, requests, python-dotenv
Visualization: matplotlib
XML Processing: lxml
LLM Support: openai

See requirements.txt for complete list with versions.

Configuration

constants.py

Global dataset and path configuration:

SYSTEM_ROOT = "/mnt/mydrive"  # Root directory for all data
TRAIN_DATASET = os.path.join(DATASET_PATH, "drivefusion_train")
TEST_DATASET = os.path.join(DATASET_PATH, "drivefusion_test")

config.py (team_code)

Autopilot parameters:

Camera positions and FOV
Control gains and thresholds
Sensor configurations
Dynamics model parameters

Included Folders

carla_data_generation/

Contains the modified PDM-Lite implementation with data collection capabilities:

original/: Reference PDM-Lite code
team_code/: Modified autopilot and data agent
leaderboard/: Custom leaderboard evaluator with logging
scenario_runner/: Modified scenario runner with data collection support

Inference-v0.2/

Inference and model evaluation components:

VLM autopilot implementation
Local QWen VLM integration
Results monitoring and validation

Advanced Features

Weather Augmentation

Randomly sampled weather conditions during collection:

Clear/cloudy skies
Rain, fog, wetness
Dynamic time of day
Street lighting

Data Validation

Format testers for output validation
Measurements verification
Dataset integrity checks

Batch Processing

SLURM job submission and management
Automatic retry on failures
Dynamic port allocation
Progress monitoring

Troubleshooting

CARLA Connection Issues:

Ensure CARLA is running: ./CarlaUE4.sh
Check port 2000 is available
Verify CARLA_ROOT environment variable

Out of Memory:

Reduce camera resolution in config.py
Disable semantic segmentation/depth if not needed
Process data in smaller batches

Dataset Format Issues:

Validate with format testers: python -m carla_data_collection.format_testers.format_tester
Check file permissions in output directory

Citation

If you use this data collection framework in your research, please cite the DriveLM paper:

@inproceedings{sima2025drivelm,
  title={DriveLM: Driving with Graph Visual Question Answering},
  author={Sima, Chonghao and Renz, Katrin and Chitta, Kashyap and Chen, Li and Zhang, Hanxue and 
          Xie, Chengen and Beißwenger, Jens and Luo, Ping and Geiger, Andreas and Li, Hongyang},
  booktitle={European Conference on Computer Vision},
  pages={256--274},
  year={2025}
}

For more information about DriveLM, visit the OpenDriveLab repository.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for full details.

License Breakdown

Component	License	Details
Main Code	Apache 2.0	Framework code and implementations
Language Data	CC BY-NC-SA 4.0	VQA annotations and scene descriptions
Leaderboard Module	MIT	From CARLA Leaderboard 2.0
Scenario Runner	MIT	From CARLA Scenario Runner
Third-party Datasets	Various	nuScenes and other datasets retain their licenses

Apache License 2.0 Summary

You are free to:

✅ Use the software for any purpose (commercial or non-commercial)
✅ Distribute copies or modified versions
✅ Modify the software
✅ Use the software privately

Under the conditions that:

⚠️ You include a copy of the license and copyright notice
⚠️ You provide a summary of changes made to the code
⚠️ You acknowledge the original authors

For the complete legal terms, refer to the LICENSE file in the repository root.

Contributing

We welcome contributions! Areas of interest:

Additional CARLA scenarios and towns
New sensor modalities
Format converters for other frameworks
Documentation improvements
Bug fixes and optimizations

See individual module READMEs for specific contribution guidelines.

Contact & Support

Issues: GitHub issues for bug reports and feature requests
Documentation: See carla_data_collection/carla_data_generation/README.md for detailed PDM-Lite documentation
Research: For questions about the DriveLM paper, see OpenDriveLab

Acknowledgments

This project stands on the shoulders of excellent open-source communities. Special thanks to:

CARLA simulator creators and maintainers
CARLA Leaderboard 2.0 organizers
Autonomous vision research community
All contributors to referenced projects

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Inference-v0.2		Inference-v0.2
assets		assets
carla_data_collection		carla_data_collection
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
constants.py		constants.py
requirements.txt		requirements.txt
run_pdm_lite_local.sh		run_pdm_lite_local.sh

Folders and files

Latest commit

History

Repository files navigation

CARLA Data Collection Framework

Overview

Key Features

Project Structure

Getting Started

Prerequisites

Installation

Usage

Data Collection

Data Processing

Language Annotation

Core Components

PDM-Lite Autopilot

Data Agent

Format Specifications

External Sources & Attribution

Core Frameworks & Tools

Reference Implementations

Research & Methods

Data Standards

Dependencies

Configuration

constants.py

config.py (team_code)

Included Folders

carla_data_generation/

Inference-v0.2/

Advanced Features

Weather Augmentation

Data Validation

Batch Processing

Troubleshooting

Citation

License

License Breakdown

Apache License 2.0 Summary

Contributing

Contact & Support

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages