Skip to content

DriveFusion/carla-data-collection

Repository files navigation

DriveFusion Logo

CARLA Data Collection Framework

A comprehensive framework for collecting and processing autonomous driving data using CARLA simulator, built on state-of-the-art rule-based driving policies and advanced data augmentation techniques.

Python License Status


Overview

This repository provides a complete pipeline for:

  • Data Collection: Autonomous driving scenarios using PDM-Lite, a rule-based privileged expert system
  • Data Processing: Format conversion (pre-DriveFusion, DriveFusion formats), dataset cleaning, and validation
  • Language Generation: Vision-Language Model (VLM) annotation with VQA and scene descriptions
  • Multi-Modal Dataset Creation: Camera images, LiDAR point clouds, semantic segmentation, depth estimation, and measurements

Key Features

  • Rule-Based Autopilot: PDM-Lite achieves state-of-the-art performance on CARLA Leaderboard 2.0
  • Multi-Town Support: Supports CARLA towns including old_towns, Town 12, and Town 13
  • Multi-Camera System: Front, front-left, front-right, back-left, back-right camera perspectives
  • Advanced Sensors: LiDAR, semantic segmentation, depth estimation, bounding box generation
  • Weather Augmentation: Dynamic weather and lighting conditions for robust data
  • Batch Processing: SLURM cluster support for large-scale data generation
  • Format Flexibility: Multiple output formats for different downstream tasks

Project Structure

carla-data-collection/
├── carla_data_collection/
│   ├── carla_data_generation/    # Core data collection pipeline
│   │   ├── team_code/            # Autopilot and data agent implementations
│   │   ├── scenario_runner/       # CARLA scenario execution (from CARLA)
│   │   └── leaderboard/           # CARLA Leaderboard 2.0 (modified)
│   ├── drivefusion_formatters/    # Format converters for DriveFusion format
│   ├── clean_dataset/             # Data cleaning utilities
│   ├── load_measurements/         # Measurement file processing
│   ├── language_generation/       # VQA and language annotation
│   └── format_testers/            # Format validation tools
├── Inference-v0.2/                # Inference and model deployment
├── constants.py                   # Project-wide configuration
└── requirements.txt               # Python dependencies

Getting Started

Prerequisites

  • Python 3.8+ (<=3.10 recommended)
  • CARLA 0.9.15 simulator
  • GPU recommended (NVIDIA with CUDA support)
  • Sufficient disk space for datasets (depends on collection size)

Installation

  1. Clone the repository:

    git clone https://github.com/DriveFusion/carla-data-collection.git
    cd carla-data-collection
  2. Set up CARLA (automated setup available):

    cd carla_data_collection/carla_data_generation
    chmod +x setup_carla.sh
    ./setup_carla.sh
  3. Create Python environment:

    conda env create -f carla_data_collection/carla_data_generation/environment.yml
    conda activate carla-datacol
  4. Install dependencies:

    pip install -r requirements.txt
  5. Configure environment variables (add to .bashrc or .zshrc):

    export CARLA_ROOT=/path/to/CARLA
    export WORK_DIR=/path/to/carla-data-collection
    export PYTHONPATH=$PYTHONPATH:${CARLA_ROOT}/PythonAPI
    export PYTHONPATH=$PYTHONPATH:${CARLA_ROOT}/PythonAPI/carla
    export SCENARIO_RUNNER_ROOT=${WORK_DIR}/carla_data_collection/carla_data_generation/scenario_runner
    export LEADERBOARD_ROOT=${WORK_DIR}/carla_data_collection/carla_data_generation/leaderboard
  6. Update configuration (if needed):

    • Modify constants.py to match your dataset paths
    • Adjust carla_data_collection/team_code/config.py for autopilot parameters

Usage

Data Collection

Local Collection (single machine):

cd carla_data_collection/carla_data_generation
bash run_pdm_lite_local.sh

Cluster Collection (SLURM):

python collect_dataset_slurm.py

Set environment variables in the script:

  • DATAGEN=1: Enable data collection mode
  • DEBUG_CHALLENGE=1: Display agent visualizations
  • PTH_ROUTE: Path to route file
  • PTH_LOG: Output directory for results

Data Processing

Format Conversion (to DriveFusion format):

python carla_data_collection/drivefusion_formatters/drivefusion_formatter.py

Dataset Cleaning:

python carla_data_collection/clean_dataset/clean.py

Measurements Generation:

python carla_data_collection/load_measurements/measurements_generator.py

Language Annotation

Generate VQA annotations:

python carla_data_collection/language_generation/language_labels/drivelm/carla_vqa_generator_main.py \
    --data-root /path/to/dataset \
    --output-dir /path/to/output

Core Components

PDM-Lite Autopilot

Rule-based driving policy achieving state-of-the-art CARLA Leaderboard 2.0 performance:

  • Route Planner: Navigation using pre-computed routes
  • Lateral Controller: PID-based steering control
  • Longitudinal Controller: Linear regression-based speed control
  • Kinematic Bicycle Model: Vehicle dynamics simulation

Performance Metrics (CARLA Leaderboard 2.0):

Dataset Success Rate Infraction Score Driving Score
DevTest (10 seeds) 100.0% 0.41/0.59 40.8/58.5
Validation (3 seeds) 91.3% 0.41 36.3
Training (1 seed) 98.8% 0.49 48.5
Bench2Drive (3 seeds) 98.8% 0.98 97.0

Data Agent

Extends autopilot with comprehensive data collection:

  • Multi-camera recording (RGB + augmented)
  • LiDAR point cloud generation
  • Semantic segmentation maps
  • Depth estimation
  • Bounding box extraction with vehicle attributes
  • Measurements logging (speed, steering, acceleration, etc.)

Format Specifications

Pre-DriveFusion Format (pre_drivefusion_format.json.gz):

  • Raw sensor data with measurements
  • Compressed JSON for efficient storage

DriveFusion Format (drivefusion_format.json.gz):

  • Standardized structure for end-to-end models
  • Includes VQA annotations and scene descriptions

External Sources & Attribution

This project builds upon several open-source repositories and community contributions:

Core Frameworks & Tools

Reference Implementations

  • carla_garage / Transfuser++: State-of-the-art CARLA agent architectures
    • Inspired lateral and longitudinal control implementations
    • Vehicle dynamics modeling approaches

Research & Methods

  • DriveLM: Original DriveLM framework with PDM-Lite implementation
    • Core rule-based autopilot implementation
    • Data collection and augmentation strategies
    • VQA generation methodology

Data Standards

  • nuScenes: Multi-modal autonomous driving dataset format
    • Dataset format conversion reference
    • Evaluation metrics inspiration

Dependencies

Key Python packages:

  • Deep Learning: torch, torchvision
  • Data Processing: numpy, opencv-python, pillow, scipy, shapely
  • Utilities: carla, tqdm, ujson, requests, python-dotenv
  • Visualization: matplotlib
  • XML Processing: lxml
  • LLM Support: openai

See requirements.txt for complete list with versions.

Configuration

constants.py

Global dataset and path configuration:

SYSTEM_ROOT = "/mnt/mydrive"  # Root directory for all data
TRAIN_DATASET = os.path.join(DATASET_PATH, "drivefusion_train")
TEST_DATASET = os.path.join(DATASET_PATH, "drivefusion_test")

config.py (team_code)

Autopilot parameters:

  • Camera positions and FOV
  • Control gains and thresholds
  • Sensor configurations
  • Dynamics model parameters

Included Folders

carla_data_generation/

Contains the modified PDM-Lite implementation with data collection capabilities:

  • original/: Reference PDM-Lite code
  • team_code/: Modified autopilot and data agent
  • leaderboard/: Custom leaderboard evaluator with logging
  • scenario_runner/: Modified scenario runner with data collection support

Inference-v0.2/

Inference and model evaluation components:

  • VLM autopilot implementation
  • Local QWen VLM integration
  • Results monitoring and validation

Advanced Features

Weather Augmentation

Randomly sampled weather conditions during collection:

  • Clear/cloudy skies
  • Rain, fog, wetness
  • Dynamic time of day
  • Street lighting

Data Validation

  • Format testers for output validation
  • Measurements verification
  • Dataset integrity checks

Batch Processing

  • SLURM job submission and management
  • Automatic retry on failures
  • Dynamic port allocation
  • Progress monitoring

Troubleshooting

CARLA Connection Issues:

  • Ensure CARLA is running: ./CarlaUE4.sh
  • Check port 2000 is available
  • Verify CARLA_ROOT environment variable

Out of Memory:

  • Reduce camera resolution in config.py
  • Disable semantic segmentation/depth if not needed
  • Process data in smaller batches

Dataset Format Issues:

  • Validate with format testers: python -m carla_data_collection.format_testers.format_tester
  • Check file permissions in output directory

Citation

If you use this data collection framework in your research, please cite the DriveLM paper:

@inproceedings{sima2025drivelm,
  title={DriveLM: Driving with Graph Visual Question Answering},
  author={Sima, Chonghao and Renz, Katrin and Chitta, Kashyap and Chen, Li and Zhang, Hanxue and 
          Xie, Chengen and Beißwenger, Jens and Luo, Ping and Geiger, Andreas and Li, Hongyang},
  booktitle={European Conference on Computer Vision},
  pages={256--274},
  year={2025}
}

For more information about DriveLM, visit the OpenDriveLab repository.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for full details.

License Breakdown

Component License Details
Main Code Apache 2.0 Framework code and implementations
Language Data CC BY-NC-SA 4.0 VQA annotations and scene descriptions
Leaderboard Module MIT From CARLA Leaderboard 2.0
Scenario Runner MIT From CARLA Scenario Runner
Third-party Datasets Various nuScenes and other datasets retain their licenses

Apache License 2.0 Summary

You are free to:

  • ✅ Use the software for any purpose (commercial or non-commercial)
  • ✅ Distribute copies or modified versions
  • ✅ Modify the software
  • ✅ Use the software privately

Under the conditions that:

  • ⚠️ You include a copy of the license and copyright notice
  • ⚠️ You provide a summary of changes made to the code
  • ⚠️ You acknowledge the original authors

For the complete legal terms, refer to the LICENSE file in the repository root.

Contributing

We welcome contributions! Areas of interest:

  • Additional CARLA scenarios and towns
  • New sensor modalities
  • Format converters for other frameworks
  • Documentation improvements
  • Bug fixes and optimizations

See individual module READMEs for specific contribution guidelines.

Contact & Support

  • Issues: GitHub issues for bug reports and feature requests
  • Documentation: See carla_data_collection/carla_data_generation/README.md for detailed PDM-Lite documentation
  • Research: For questions about the DriveLM paper, see OpenDriveLab

Acknowledgments

This project stands on the shoulders of excellent open-source communities. Special thanks to:

  • CARLA simulator creators and maintainers
  • CARLA Leaderboard 2.0 organizers
  • Autonomous vision research community
  • All contributors to referenced projects

About

Autonomous-driving data pipeline for DriveFusion project built on the CARLA Simulator, generating cleaned multi-modal sensor data and VQA annotations for training vision-language action models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors