Skip to content

nellaivijay/duckdb-code-practice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

DuckDB Code Practice

Python DuckDB License Jupyter

πŸ“– Table of Contents

🎯 Educational Mission

A comprehensive, vendor-independent DuckDB learning environment designed for developers, data engineers, and analysts who want to master modern in-memory SQL analytics and lakehouse architecture through hands-on practice.

15 comprehensive labs with 120+ exercises covering DuckDB fundamentals through production deployment. Aligned with comprehensive DuckDB learning curriculum. Completely free and open source. Built for learners, by learners.

πŸŽ“ Why This Repository?

This educational resource fills the gap between theoretical knowledge and practical skills in DuckDB, lakehouse architecture, and modern analytics technologies:

  • Learn by Doing: Progressive hands-on labs build real skills
  • Vendor Independent: Master concepts that apply across all platforms
  • Lakehouse Focus: Learn modern data lakehouse architecture patterns
  • Production Patterns: Learn ETL, data quality, and production operations
  • Multi-Language Experience: Work with Python, SQL, and command-line interfaces
  • Community Driven: Built and improved by the analytics community

πŸŽ“ Learning Approach

Progressive Complexity

Our labs are designed to build knowledge progressively:

  • Beginner (Labs 0-2): Foundation, introduction, and basic operations
  • Intermediate (Labs 3-6): Advanced features, data exploration, and optimization
  • Advanced (Labs 7-10): Cloud integration, pipelines, applications, and client APIs

Hands-On Learning

Each lab includes:

  • Clear Learning Objectives: Know what you'll achieve
  • Step-by-Step Instructions: Guided exercises
  • Real-World Scenarios: Practical use cases
  • Solution Notebooks: Reference implementations
  • Conceptual Guides: Deep-dive explanations

Multi-Language Experience

Gain experience with different interfaces:

  • Python API: Programmatic access with duckdb package
  • SQL Shell: Interactive SQL command-line interface
  • Jupyter Notebooks: Interactive analysis environment
  • CLI Tools: Command-line utilities for data processing

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   DuckDB Code Practice                     β”‚
β”‚                   Lakehouse Learning Environment            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         Lakehouse Architecture Layers               β”‚  β”‚
β”‚  β”‚         - Bronze: Raw data ingestion               β”‚  β”‚
β”‚  β”‚         - Silver: Cleaned & validated              β”‚  β”‚
β”‚  β”‚         - Gold: Business-ready aggregates          β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                              ↓                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         DuckDB Core Engine                           β”‚  β”‚
β”‚  β”‚         - In-memory OLAP database                   β”‚  β”‚
β”‚  β”‚         - Columnar storage format                   β”‚  β”‚
β”‚  β”‚         - SQL query engine                          β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                              ↓                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         Client Interfaces                           β”‚  β”‚
β”‚  β”‚         - Python API (duckdb package)              β”‚  β”‚
β”‚  β”‚         - SQL Shell (duckdb command)               β”‚  β”‚
β”‚  β”‚         - Jupyter Integration                      β”‚  β”‚
β”‚  β”‚         - CLI Tools                                 β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                              ↓                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         Data Formats & Extensions                    β”‚  β”‚
β”‚  β”‚         - Parquet files (lakehouse standard)        β”‚  β”‚
β”‚  β”‚         - Apache Arrow (zero-copy)                 β”‚  β”‚
β”‚  β”‚         - CSV/JSON (interchange)                   β”‚  β”‚
β”‚  β”‚         - Extension ecosystem                       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                              ↓                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚         Production Operations                       β”‚  β”‚
β”‚  β”‚         - ETL pipelines                             β”‚  β”‚
β”‚  β”‚         - Data quality frameworks                   β”‚  β”‚
β”‚  β”‚         - Monitoring & alerting                     β”‚  β”‚
β”‚  β”‚         - Backup & recovery                        β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ Core Stack

Database Engine

  • DuckDB: In-memory SQL OLAP database
  • Columnar storage for analytical queries
  • Zero-copy integration with Arrow

Data Formats

  • Parquet: Columnar storage format
  • Apache Arrow: In-memory columnar format
  • CSV/JSON: Common data interchange formats

Client Interfaces

  • Python: duckdb package for programmatic access
  • SQL Shell: Interactive command-line interface
  • Jupyter: Notebook integration for interactive analysis
  • CLI Tools: Command-line utilities for data processing

Extensions

  • httpfs: HTTP filesystem support for remote data
  • parquet: Advanced Parquet functionality
  • json: Enhanced JSON support
  • spatial: Geospatial data processing

πŸŽ“ Lab Structure

Lab Difficulty & Time Estimates

Level Labs Time per Lab What It Tests
Beginner Labs 0-2 30-60 min Basic setup, introduction, SQL operations, fundamental concepts
Intermediate Labs 3-6 45-75 min Advanced features, data exploration, optimization patterns
Advanced Labs 7-10 60-120 min Cloud integration, pipelines, applications, client APIs

Lab 0: Sample Database Setup

  • Generate and load realistic business data
  • Explore sample database schema and relationships
  • Practice queries on sample data
  • Prerequisite for all subsequent labs

Lab 1: Environment Setup

  • Install DuckDB and dependencies
  • Test database connectivity
  • Validate Python API setup
  • Explore different interfaces

Lab 1A: Introduction to DuckDB (Chapter 1)

  • Understand what DuckDB is and its characteristics
  • Learn when to use DuckDB vs. other databases
  • Explore DuckDB's place in the data ecosystem
  • Understand the complete data processing flow
  • Practice DuckDB-specific SQL extensions

Lab 2: Basic DuckDB Operations (Chapter 3)

  • Create databases and tables using DDL
  • Insert, update, and delete data using DML
  • Execute SQL queries and understand results
  • Work with different data types and functions
  • Practice DuckDB-specific SQL extensions

Lab 3: Advanced Features (Chapter 4)

  • Window functions and analytical queries
  • Advanced aggregation and grouping sets
  • Complex subqueries and CTEs
  • PIVOT operations and ASOF joins
  • LATERAL joins and table functions
  • FILTER, QUALIFY, and HAVING clauses

Lab 4: DuckDB + Python Integration (Chapter 6)

  • Deep dive into Python API for programmatic access
  • Seamless pandas and NumPy integration
  • User-defined functions (UDFs) in Python
  • Apache Arrow and Polars interoperability
  • Building data processing pipelines

Lab 5: Data Format Operations

  • Parquet file operations for lakehouse storage
  • Apache Arrow integration for zero-copy operations
  • CSV/JSON processing and conversion
  • Data format optimization strategies

Lab 5A: Exploring Data Without Persistence (Chapter 5)

  • Query data files directly without creating tables
  • Automatic file type and schema inference
  • Shred nested JSON structures
  • Convert between data formats (CSV to Parquet)
  • Query Parquet files directly
  • Access SQLite and other databases
  • Work with Excel files

Lab 6: Performance & Optimization (Chapter 10)

  • Query execution plan analysis with EXPLAIN
  • Index strategies and performance tuning
  • Memory and thread configuration optimization
  • Loading and querying large datasets (Stack Overflow, NYC Taxi)
  • Export data to Parquet for performance
  • S3 integration and cloud data access

Lab 7: Extensions & Advanced Features

  • HTTP filesystem for remote data lake access
  • Spatial data processing and analysis
  • Advanced JSON operations for semi-structured data
  • Custom functions and UDFs for business logic

Lab 7: DuckDB in the Cloud with MotherDuck (Chapter 7)

  • Introduction to MotherDuck and its architecture
  • Set up and configure MotherDuck account
  • Connect to MotherDuck using CLI and token authentication
  • Upload and manage databases in the cloud
  • Share databases with collaborators
  • Configure S3 secrets and load data from S3
  • Optimize data ingestion and usage
  • Query data with AI assistance
  • Explore MotherDuck integrations

Lab 8: Real-World Use Cases and Patterns

  • ETL pipeline implementation with error handling
  • Data quality frameworks and validation
  • Slowly Changing Dimensions (SCD) implementation
  • Batch processing workflows

Lab 8A: Building Data Pipelines (Chapter 8)

  • Data ingestion with dlt (Data Loading Tool)
  • Set up and configure dlt pipelines
  • Explore pipeline metadata and monitoring
  • Data transformation with dbt (data build tool)
  • Set up dbt projects with DuckDB
  • Define sources, models, and transformations
  • Test transformations and pipelines
  • Orchestrate data pipelines with Dagster
  • Define assets and dependencies
  • Run and monitor Dagster pipelines
  • Upload processed data to MotherDuck

Lab 9: Integration and Production Readiness

  • External database and system integration
  • Production deployment strategies (Docker/Kubernetes)
  • Monitoring, alerting, and health checks
  • Backup, recovery, and disaster procedures
  • Security implementation and access control

Lab 9: Building and Deploying Data Apps (Chapter 9)

  • Build custom data apps with Streamlit
  • Use Streamlit components for enhanced functionality
  • Visualize data using plotly
  • Deploy Streamlit apps on Community Cloud
  • Build BI dashboards with Apache Superset
  • Create datasets from SQL queries
  • Export and import Superset dashboards
  • Integrate DuckDB with both tools

Lab 10: Client APIs for DuckDB (Appendix)

  • Overview of officially supported languages
  • Concurrency considerations and best practices
  • Importing large amounts of data efficiently
  • Using DuckDB from Java via JDBC Driver
  • Multi-threaded access patterns
  • Data processing from Java
  • Additional connection options and configuration
  • Cross-language API comparison

πŸ’Ύ Sample Database

The environment includes a comprehensive sample database with realistic business data for hands-on learning:

Sample Tables

  • sample_customers (1,000 records): Customer dimension with segmentation
  • sample_products (200 records): Product catalog with categories
  • sample_orders (5,000 records): Order fact table with status tracking
  • sample_transactions (10,000 records): Transaction details with payment methods
  • sample_events (20,000 records): Web events for user engagement analysis

Loading Sample Data

# Generate and load sample data
python3 scripts/generate_sample_data.py
python3 scripts/load_sample_data.py

Sample Data Documentation

πŸš€ Quick Start

πŸŽ“ New to DuckDB?

Follow our recommended learning path:

  1. Start with Fundamentals: Read DuckDB Fundamentals wiki page
  2. Set Up Environment: Follow Getting Started Guide
  3. Begin Lab 0: Load sample data with Lab 0
  4. Progress Through Labs: Follow the Learning Path

πŸ“‹ Setup Options

Option 1: Python Environment (Recommended)

cd duckdb-code-practice
pip install -r requirements.txt
python3 scripts/setup.py

Option 2: Docker Environment

cd duckdb-code-practice
docker-compose up -d

πŸ“‹ Requirements

  • Python 3.8+
  • pip (Python package manager)
  • 4GB RAM minimum (8GB recommended)
  • 2GB disk space minimum

πŸ”§ Configuration

Python Environment Setup

# Install dependencies
pip install duckdb pandas jupyter

# Install optional extensions
pip install duckdb-httpfs duckdb-spatial

DuckDB Configuration

# Configure DuckDB settings
import duckdb
con = duckdb.connect()
con.execute("SET memory_limit='4GB'")
con.execute("SET threads=4")

πŸ“š Documentation

πŸŽ“ Educational Resources

Wiki Guides (Comprehensive learning materials):

Core Documentation

πŸŽ“ Conceptual Guides (Tutorials)

Deep-dive tutorials explaining the "Why" behind the "How":

Lab Materials

πŸ’‘ Jupyter Notebooks

Interactive Jupyter notebooks for hands-on learning:

πŸ€– Automation Scripts

πŸ”— Related Practice Repositories

Continue your learning journey with these related repositories:

AI/ML Practice

Data Engineering Practice

Programming Practice

Resource Hub

πŸ†˜ Vendor Independence

This environment uses only MIT-licensed tools:

  • DuckDB (MIT)
  • Python packages (various open source licenses)
  • Jupyter (BSD)
  • Pandas (BSD)
  • Apache Arrow (Apache 2.0)

No proprietary cloud services or consoles required.

🀝 Contributing

This is a practice environment for learning. Feel free to extend labs, add examples, or improve the setup process.

Disclaimer: This is an independent educational resource for learning DuckDB and modern analytics concepts. It is not affiliated with, endorsed by, or sponsored by DuckDB or any vendor.

πŸ‘₯ Community and Learning

This repository is an open educational resource built for the data analytics community. We believe in learning together and sharing knowledge.

🀝 Learning Together

  • πŸ“– Comprehensive Wiki: Detailed guides and tutorials for all skill levels
  • πŸ’¬ GitHub Discussions: Ask questions and share insights with fellow learners
  • πŸ› Issue Tracking: Report bugs and suggest improvements
  • πŸ”„ Pull Requests: Contribute labs, fixes, and enhancements
  • ⭐ Star the Repo: Show your support and help others discover this resource

πŸŽ“ Contributing to Learning

We welcome contributions that improve the educational value:

  • New Labs: Suggest new lab topics and exercises
  • Better Explanations: Improve clarity of existing content
  • Additional Examples: Add more practical examples
  • Translation: Help translate content for global learners
  • Bug Fixes: Report and fix issues in labs or documentation

See CONTRIBUTING.md for detailed contribution guidelines.

πŸ“š Additional Learning Resources

  • Official DuckDB Documentation: https://duckdb.org/docs/
  • DuckDB Blog: Latest updates and articles
  • Conference Talks: Learn from industry experts

πŸ“„ License

Apache License 2.0

About

Educational DuckDB practice environment with hands-on labs and exercises

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors