A robust and efficient tool for importing Excel(.xls and .xlsx) data to DuckDB databases with comprehensive validation and optimization.
- Universal Compatibility: Process any Excel file regardless of format or structure
- Multi-sheet Support: Convert each worksheet to a separate database table
- Intelligent Type Inference: Automatically detect and map data types
- Safe Mode: Ensure data integrity with conservative type conversion
- Large File Handling: Process large Excel files with reservoir sampling
- Batch Processing: Efficiently import data in configurable batches
- Comprehensive Validation: Verify data integrity after import
- Performance Optimization: Automatically optimize tables with indexes
- Detailed Logging: Track the entire process with comprehensive logs
- Parallel Processing: Utilize multi-threading for faster Excel analysis
- Memory-Aware: Dynamically adjust batch sizes based on available system memory
- Robust Error Handling: Gracefully handle various error scenarios with fallback mechanisms
- Python 3.7 or higher
- uv - Fast Python package installer and resolver
If you don't have uv installed, you can install it using:
# On macOS using Homebrew
brew install uv
# On macOS/Linux using curl
curl -LsSf https://astral.sh/uv/install.sh | sh
# On Windows using PowerShell
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# Using pip
pip install uv# Clone the repository
git clone https://github.com/yourusername/e2duck.git
cd e2duck
# Create a virtual environment using uv
uv venv
# Activate the virtual environment
# On macOS/Linux:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activate
# Install dependencies using uv
uv pip install -r requirements.txt
# For development (includes testing tools)
uv pip install -r requirements-dev.txtTo verify that everything is installed correctly:
python -c "import pandas; import duckdb; import psutil; print('Installation successful!')"python run.py path/to/excel_file.xlsx path/to/output.duckdb [options]Available options:
--sample-size SIZE: Number of rows to sample for validation (default: 100)--no-safe-mode: Disable safe mode (default: enabled)
from e2duck.e2duck import ExcelToDuckDB
# Initialize the converter
converter = ExcelToDuckDB(
excel_path="path/to/your/file.xlsx",
db_path="path/to/output.duckdb",
sample_size=100, # Number of rows to sample for validation
safe_mode=True # Use TEXT type for all columns to ensure data integrity
)
# Run the full import process
result = converter.run()
# Check the result
if result['success']:
print("Import successful!")
# Access validation results
validation_results = result.get('validation_results', [])
for validation in validation_results:
print(f"Sheet '{validation['sheet']}': {validation['overall_status']}")
else:
print(f"Import failed: {result.get('error', 'Unknown error')}")The tool automatically adjusts batch sizes based on available system memory:
# Let the system determine optimal batch size based on memory
result = converter.import_data()
# Or specify a custom batch size
result = converter.import_data(batch_size=10000)For multi-sheet Excel files, sheets are processed in parallel:
# Analysis phase uses parallel processing for multiple sheets
sheets_info = converter.analyze_excel()- Connection: Establish connection to DuckDB with optimized settings
- Analysis: Scan Excel structure and extract sample data using parallel processing
- Table Creation: Create DuckDB tables based on Excel structure
- Data Import: Transfer data using memory-aware batch processing
- Validation: Verify data integrity and completeness
- Optimization: Add indexes and optimize table structure
| Parameter | Description | Default |
|---|---|---|
excel_path |
Path to the Excel file | Required |
db_path |
Path to the DuckDB database | Required |
sample_size |
Number of rows to sample for validation | 100 |
safe_mode |
Use TEXT type for all columns | True |
batch_size |
Number of rows to process in each batch | Dynamic based on memory |
- Memory Usage: The tool uses batch processing and reservoir sampling to minimize memory usage
- Dynamic Batch Sizing: Automatically adjusts batch size based on available system memory
- Processing Speed: Configurable batch size to balance between speed and memory usage
- Database Optimization: Automatic index creation for frequently queried columns
- Parallel Processing: Multi-threaded Excel analysis and DuckDB's parallel processing capabilities
- Excel Engine Selection: Attempts to use the most efficient Excel engine for each file format
- File Format:
.xlsxfiles process faster than.xlsfiles due to better engine support
The project includes comprehensive test coverage (62%) to ensure functionality and reliability:
# Run all tests
pytest tests/
# Run specific test file
pytest tests/test_initialization.py
# Run tests with coverage report
pytest --cov=e2duck tests/
# Generate detailed HTML coverage report
pytest --cov=e2duck --cov-report=html tests/The tool provides comprehensive error handling with:
- Detailed error messages with specific error types
- Standardized logging with appropriate log levels
- Transaction-based imports to prevent partial data imports
- Multiple fallback mechanisms for data import (COPY, bulk INSERT, row-by-row INSERT)
- Validation to ensure data integrity
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
