e2duck: Excel to DuckDB Importer

A robust and efficient tool for importing Excel(.xls and .xlsx) data to DuckDB databases with comprehensive validation and optimization.

Features

Universal Compatibility: Process any Excel file regardless of format or structure
Multi-sheet Support: Convert each worksheet to a separate database table
Intelligent Type Inference: Automatically detect and map data types
Safe Mode: Ensure data integrity with conservative type conversion
Large File Handling: Process large Excel files with reservoir sampling
Batch Processing: Efficiently import data in configurable batches
Comprehensive Validation: Verify data integrity after import
Performance Optimization: Automatically optimize tables with indexes
Detailed Logging: Track the entire process with comprehensive logs
Parallel Processing: Utilize multi-threading for faster Excel analysis
Memory-Aware: Dynamically adjust batch sizes based on available system memory
Robust Error Handling: Gracefully handle various error scenarios with fallback mechanisms

Installation

Prerequisites

Python 3.7 or higher
uv - Fast Python package installer and resolver

Installing uv

If you don't have uv installed, you can install it using:

# On macOS using Homebrew
brew install uv

# On macOS/Linux using curl
curl -LsSf https://astral.sh/uv/install.sh | sh

# On Windows using PowerShell
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# Using pip
pip install uv

Setting up the project

# Clone the repository
git clone https://github.com/yourusername/e2duck.git
cd e2duck

# Create a virtual environment using uv
uv venv

# Activate the virtual environment
# On macOS/Linux:
source .venv/bin/activate
# On Windows:
.venv\Scripts\activate

# Install dependencies using uv
uv pip install -r requirements.txt

# For development (includes testing tools)
uv pip install -r requirements-dev.txt

Verifying Installation

To verify that everything is installed correctly:

python -c "import pandas; import duckdb; import psutil; print('Installation successful!')"

Usage

Command Line Interface

python run.py path/to/excel_file.xlsx path/to/output.duckdb [options]

Available options:

--sample-size SIZE: Number of rows to sample for validation (default: 100)
--no-safe-mode: Disable safe mode (default: enabled)

As a Module

from e2duck.e2duck import ExcelToDuckDB

# Initialize the converter
converter = ExcelToDuckDB(
    excel_path="path/to/your/file.xlsx",
    db_path="path/to/output.duckdb",
    sample_size=100,  # Number of rows to sample for validation
    safe_mode=True    # Use TEXT type for all columns to ensure data integrity
)

# Run the full import process
result = converter.run()

# Check the result
if result['success']:
    print("Import successful!")
    
    # Access validation results
    validation_results = result.get('validation_results', [])
    for validation in validation_results:
        print(f"Sheet '{validation['sheet']}': {validation['overall_status']}")
else:
    print(f"Import failed: {result.get('error', 'Unknown error')}")

Advanced Usage

Dynamic Batch Size

The tool automatically adjusts batch sizes based on available system memory:

# Let the system determine optimal batch size based on memory
result = converter.import_data()

# Or specify a custom batch size
result = converter.import_data(batch_size=10000)

Parallel Sheet Processing

For multi-sheet Excel files, sheets are processed in parallel:

# Analysis phase uses parallel processing for multiple sheets
sheets_info = converter.analyze_excel()

Workflow

Connection: Establish connection to DuckDB with optimized settings
Analysis: Scan Excel structure and extract sample data using parallel processing
Table Creation: Create DuckDB tables based on Excel structure
Data Import: Transfer data using memory-aware batch processing
Validation: Verify data integrity and completeness
Optimization: Add indexes and optimize table structure

Configuration Options

Parameter	Description	Default
`excel_path`	Path to the Excel file	Required
`db_path`	Path to the DuckDB database	Required
`sample_size`	Number of rows to sample for validation	100
`safe_mode`	Use TEXT type for all columns	True
`batch_size`	Number of rows to process in each batch	Dynamic based on memory

Performance Considerations

Memory Usage: The tool uses batch processing and reservoir sampling to minimize memory usage
Dynamic Batch Sizing: Automatically adjusts batch size based on available system memory
Processing Speed: Configurable batch size to balance between speed and memory usage
Database Optimization: Automatic index creation for frequently queried columns
Parallel Processing: Multi-threaded Excel analysis and DuckDB's parallel processing capabilities
Excel Engine Selection: Attempts to use the most efficient Excel engine for each file format
File Format: .xlsx files process faster than .xls files due to better engine support

Testing

The project includes comprehensive test coverage (62%) to ensure functionality and reliability:

# Run all tests
pytest tests/

# Run specific test file
pytest tests/test_initialization.py

# Run tests with coverage report
pytest --cov=e2duck tests/

# Generate detailed HTML coverage report
pytest --cov=e2duck --cov-report=html tests/

Error Handling

The tool provides comprehensive error handling with:

Detailed error messages with specific error types
Standardized logging with appropriate log levels
Transaction-based imports to prevent partial data imports
Multiple fallback mechanisms for data import (COPY, bulk INSERT, row-by-row INSERT)
Validation to ensure data integrity

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

e2duck: Excel to DuckDB Importer

Features

Installation

Prerequisites

Installing uv

Setting up the project

Verifying Installation

Usage

Command Line Interface

As a Module

Advanced Usage

Dynamic Batch Size

Parallel Sheet Processing

Workflow

Configuration Options

Performance Considerations

Testing

Error Handling

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
e2duck		e2duck
tests		tests
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
logo_e2duck.png		logo_e2duck.png
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

e2duck: Excel to DuckDB Importer

Features

Installation

Prerequisites

Installing uv

Setting up the project

Verifying Installation

Usage

Command Line Interface

As a Module

Advanced Usage

Dynamic Batch Size

Parallel Sheet Processing

Workflow

Configuration Options

Performance Considerations

Testing

Error Handling

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages