- π― Educational Mission
- π Why This Repository?
- π Learning Approach
- ποΈ Architecture
- π οΈ Core Stack
- π Lab Structure
- πΎ Sample Database
- π Quick Start
- π Requirements
- π§ Configuration
- π Documentation
- π Related Practice Repositories
- π Vendor Independence
- π€ Contributing
- π₯ Community and Learning
- π License
A comprehensive, vendor-independent DuckDB learning environment designed for developers, data engineers, and analysts who want to master modern in-memory SQL analytics and lakehouse architecture through hands-on practice.
15 comprehensive labs with 120+ exercises covering DuckDB fundamentals through production deployment. Aligned with comprehensive DuckDB learning curriculum. Completely free and open source. Built for learners, by learners.
This educational resource fills the gap between theoretical knowledge and practical skills in DuckDB, lakehouse architecture, and modern analytics technologies:
- Learn by Doing: Progressive hands-on labs build real skills
- Vendor Independent: Master concepts that apply across all platforms
- Lakehouse Focus: Learn modern data lakehouse architecture patterns
- Production Patterns: Learn ETL, data quality, and production operations
- Multi-Language Experience: Work with Python, SQL, and command-line interfaces
- Community Driven: Built and improved by the analytics community
Our labs are designed to build knowledge progressively:
- Beginner (Labs 0-2): Foundation, introduction, and basic operations
- Intermediate (Labs 3-6): Advanced features, data exploration, and optimization
- Advanced (Labs 7-10): Cloud integration, pipelines, applications, and client APIs
Each lab includes:
- Clear Learning Objectives: Know what you'll achieve
- Step-by-Step Instructions: Guided exercises
- Real-World Scenarios: Practical use cases
- Solution Notebooks: Reference implementations
- Conceptual Guides: Deep-dive explanations
Gain experience with different interfaces:
- Python API: Programmatic access with duckdb package
- SQL Shell: Interactive SQL command-line interface
- Jupyter Notebooks: Interactive analysis environment
- CLI Tools: Command-line utilities for data processing
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DuckDB Code Practice β
β Lakehouse Learning Environment β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Lakehouse Architecture Layers β β
β β - Bronze: Raw data ingestion β β
β β - Silver: Cleaned & validated β β
β β - Gold: Business-ready aggregates β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DuckDB Core Engine β β
β β - In-memory OLAP database β β
β β - Columnar storage format β β
β β - SQL query engine β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Client Interfaces β β
β β - Python API (duckdb package) β β
β β - SQL Shell (duckdb command) β β
β β - Jupyter Integration β β
β β - CLI Tools β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Data Formats & Extensions β β
β β - Parquet files (lakehouse standard) β β
β β - Apache Arrow (zero-copy) β β
β β - CSV/JSON (interchange) β β
β β - Extension ecosystem β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Production Operations β β
β β - ETL pipelines β β
β β - Data quality frameworks β β
β β - Monitoring & alerting β β
β β - Backup & recovery β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- DuckDB: In-memory SQL OLAP database
- Columnar storage for analytical queries
- Zero-copy integration with Arrow
- Parquet: Columnar storage format
- Apache Arrow: In-memory columnar format
- CSV/JSON: Common data interchange formats
- Python: duckdb package for programmatic access
- SQL Shell: Interactive command-line interface
- Jupyter: Notebook integration for interactive analysis
- CLI Tools: Command-line utilities for data processing
- httpfs: HTTP filesystem support for remote data
- parquet: Advanced Parquet functionality
- json: Enhanced JSON support
- spatial: Geospatial data processing
| Level | Labs | Time per Lab | What It Tests |
|---|---|---|---|
| Beginner | Labs 0-2 | 30-60 min | Basic setup, introduction, SQL operations, fundamental concepts |
| Intermediate | Labs 3-6 | 45-75 min | Advanced features, data exploration, optimization patterns |
| Advanced | Labs 7-10 | 60-120 min | Cloud integration, pipelines, applications, client APIs |
- Generate and load realistic business data
- Explore sample database schema and relationships
- Practice queries on sample data
- Prerequisite for all subsequent labs
- Install DuckDB and dependencies
- Test database connectivity
- Validate Python API setup
- Explore different interfaces
- Understand what DuckDB is and its characteristics
- Learn when to use DuckDB vs. other databases
- Explore DuckDB's place in the data ecosystem
- Understand the complete data processing flow
- Practice DuckDB-specific SQL extensions
- Create databases and tables using DDL
- Insert, update, and delete data using DML
- Execute SQL queries and understand results
- Work with different data types and functions
- Practice DuckDB-specific SQL extensions
- Window functions and analytical queries
- Advanced aggregation and grouping sets
- Complex subqueries and CTEs
- PIVOT operations and ASOF joins
- LATERAL joins and table functions
- FILTER, QUALIFY, and HAVING clauses
- Deep dive into Python API for programmatic access
- Seamless pandas and NumPy integration
- User-defined functions (UDFs) in Python
- Apache Arrow and Polars interoperability
- Building data processing pipelines
- Parquet file operations for lakehouse storage
- Apache Arrow integration for zero-copy operations
- CSV/JSON processing and conversion
- Data format optimization strategies
- Query data files directly without creating tables
- Automatic file type and schema inference
- Shred nested JSON structures
- Convert between data formats (CSV to Parquet)
- Query Parquet files directly
- Access SQLite and other databases
- Work with Excel files
- Query execution plan analysis with EXPLAIN
- Index strategies and performance tuning
- Memory and thread configuration optimization
- Loading and querying large datasets (Stack Overflow, NYC Taxi)
- Export data to Parquet for performance
- S3 integration and cloud data access
- HTTP filesystem for remote data lake access
- Spatial data processing and analysis
- Advanced JSON operations for semi-structured data
- Custom functions and UDFs for business logic
- Introduction to MotherDuck and its architecture
- Set up and configure MotherDuck account
- Connect to MotherDuck using CLI and token authentication
- Upload and manage databases in the cloud
- Share databases with collaborators
- Configure S3 secrets and load data from S3
- Optimize data ingestion and usage
- Query data with AI assistance
- Explore MotherDuck integrations
- ETL pipeline implementation with error handling
- Data quality frameworks and validation
- Slowly Changing Dimensions (SCD) implementation
- Batch processing workflows
- Data ingestion with dlt (Data Loading Tool)
- Set up and configure dlt pipelines
- Explore pipeline metadata and monitoring
- Data transformation with dbt (data build tool)
- Set up dbt projects with DuckDB
- Define sources, models, and transformations
- Test transformations and pipelines
- Orchestrate data pipelines with Dagster
- Define assets and dependencies
- Run and monitor Dagster pipelines
- Upload processed data to MotherDuck
- External database and system integration
- Production deployment strategies (Docker/Kubernetes)
- Monitoring, alerting, and health checks
- Backup, recovery, and disaster procedures
- Security implementation and access control
- Build custom data apps with Streamlit
- Use Streamlit components for enhanced functionality
- Visualize data using plotly
- Deploy Streamlit apps on Community Cloud
- Build BI dashboards with Apache Superset
- Create datasets from SQL queries
- Export and import Superset dashboards
- Integrate DuckDB with both tools
- Overview of officially supported languages
- Concurrency considerations and best practices
- Importing large amounts of data efficiently
- Using DuckDB from Java via JDBC Driver
- Multi-threaded access patterns
- Data processing from Java
- Additional connection options and configuration
- Cross-language API comparison
The environment includes a comprehensive sample database with realistic business data for hands-on learning:
- sample_customers (1,000 records): Customer dimension with segmentation
- sample_products (200 records): Product catalog with categories
- sample_orders (5,000 records): Order fact table with status tracking
- sample_transactions (10,000 records): Transaction details with payment methods
- sample_events (20,000 records): Web events for user engagement analysis
# Generate and load sample data
python3 scripts/generate_sample_data.py
python3 scripts/load_sample_data.py- Sample Database Guide - Complete schema and usage documentation
- Lab 0: Sample Database Setup - Step-by-step loading and exploration
Follow our recommended learning path:
- Start with Fundamentals: Read DuckDB Fundamentals wiki page
- Set Up Environment: Follow Getting Started Guide
- Begin Lab 0: Load sample data with Lab 0
- Progress Through Labs: Follow the Learning Path
cd duckdb-code-practice
pip install -r requirements.txt
python3 scripts/setup.pycd duckdb-code-practice
docker-compose up -d- Python 3.8+
- pip (Python package manager)
- 4GB RAM minimum (8GB recommended)
- 2GB disk space minimum
# Install dependencies
pip install duckdb pandas jupyter
# Install optional extensions
pip install duckdb-httpfs duckdb-spatial# Configure DuckDB settings
import duckdb
con = duckdb.connect()
con.execute("SET memory_limit='4GB'")
con.execute("SET threads=4")Wiki Guides (Comprehensive learning materials):
- Wiki Home - Main wiki page with all guides
- Getting Started Guide - Complete setup and first steps
- DuckDB Fundamentals - Core concepts and architecture
- Lab Guides - Detailed lab walkthroughs
- Learning Path - Recommended learning sequence
- Best Practices - Production-ready patterns
- Troubleshooting - Common issues and solutions
- Setup Guide - Detailed setup instructions for Python and Docker
- Architecture Overview - System architecture and component details
- Lakehouse Architecture - Lakehouse concepts and DuckDB integration
- Operations Guide - Production operations and readiness
- Lab Guide - Complete lab sequence and learning path
- Troubleshooting - Common issues and solutions
- GitHub Pages Setup - Documentation deployment guide
- Wiki Setup - Wiki contribution and maintenance guide
Deep-dive tutorials explaining the "Why" behind the "How":
- Lakehouse Architecture - Understanding lakehouse patterns and DuckDB's role
- DuckDB Architecture - Understanding DuckDB's architecture and design
- Operations & Production Readiness - Production operations and best practices
- Lab 0: Sample Database Setup - Generate and load sample data
- Lab 1: Environment Setup - Component verification and first DuckDB query
- Lab 1A: Introduction to DuckDB - DuckDB fundamentals and ecosystem
- Lab 2: Basic DuckDB Operations - Tables, queries, data types (Chapter 3)
- Lab 3: Advanced Features - Window functions, advanced SQL (Chapter 4)
- Lab 4: DuckDB + Python Integration - Python API, pandas, UDFs (Chapter 6)
- Lab 5: Data Format Operations - Parquet, Arrow, formats
- Lab 5A: Exploring Data Without Persistence - Direct file queries, JSON shredding (Chapter 5)
- Lab 6: Performance & Optimization - Query optimization, large datasets (Chapter 10)
- Lab 7: Extensions & Advanced Features - HTTP filesystem, spatial
- Lab 7: DuckDB in the Cloud with MotherDuck - Cloud integration, S3, AI (Chapter 7)
- Lab 8: Real-World Use Cases and Patterns - ETL, SCD, production patterns
- Lab 8A: Building Data Pipelines - dlt, dbt, Dagster (Chapter 8)
- Lab 9: Integration and Production Readiness - Production deployment, monitoring
- Lab 9: Building and Deploying Data Apps - Streamlit, Superset (Chapter 9)
- Lab 10: Client APIs for DuckDB - Multi-language APIs, JDBC (Appendix)
Interactive Jupyter notebooks for hands-on learning:
- Lab Notebooks - Student notebooks with exercises (coming soon)
- Solution Helper - How to use the solution helper when stuck
- Solution Helper - Python helper for accessing solutions and hints (coming soon)
- Setup Script - Environment validation and setup
- Generate Sample Data - Generate realistic business data
- Load Sample Data - Load sample data into DuckDB
Continue your learning journey with these related repositories:
- π€ DSPy Code Practice - Declarative LLM programming
- π§ LLM Fine-Tuning Practice - Model fine-tuning techniques
- β‘ Apache Spark Code Practice - Big data processing
- ποΈ Apache Iceberg Code Practice - Lakehouse architecture
- π§ Apache Beam Code Practice - Data pipelines
- βοΈ Scala Data Analysis Practice - Functional programming
- π Awesome My Notes - Comprehensive technical notes and learning resources
This environment uses only MIT-licensed tools:
- DuckDB (MIT)
- Python packages (various open source licenses)
- Jupyter (BSD)
- Pandas (BSD)
- Apache Arrow (Apache 2.0)
No proprietary cloud services or consoles required.
This is a practice environment for learning. Feel free to extend labs, add examples, or improve the setup process.
Disclaimer: This is an independent educational resource for learning DuckDB and modern analytics concepts. It is not affiliated with, endorsed by, or sponsored by DuckDB or any vendor.
This repository is an open educational resource built for the data analytics community. We believe in learning together and sharing knowledge.
- π Comprehensive Wiki: Detailed guides and tutorials for all skill levels
- π¬ GitHub Discussions: Ask questions and share insights with fellow learners
- π Issue Tracking: Report bugs and suggest improvements
- π Pull Requests: Contribute labs, fixes, and enhancements
- β Star the Repo: Show your support and help others discover this resource
We welcome contributions that improve the educational value:
- New Labs: Suggest new lab topics and exercises
- Better Explanations: Improve clarity of existing content
- Additional Examples: Add more practical examples
- Translation: Help translate content for global learners
- Bug Fixes: Report and fix issues in labs or documentation
See CONTRIBUTING.md for detailed contribution guidelines.
- Official DuckDB Documentation: https://duckdb.org/docs/
- DuckDB Blog: Latest updates and articles
- Conference Talks: Learn from industry experts
Apache License 2.0