104 exercises + 5 production-grade pipeline labs. All on Databricks Free Edition.
Clone once, import into Databricks, pick a folder. Exercises fail loud until your code is right; labs ship with synthetic data so you build production-style pipelines, not toy ones.
New (18 April 2026): 5 full-scale pipeline labs + 1 benchmark deep-dive just landed. If you starred this repo for the exercises, they're still here - now alongside end-to-end project work.
Jakub Lasak - Databricks Data Engineer. Helping you interview like seniors, execute like seniors, and think like seniors.
- LinkedIn (13.5K followers) - Databricks projects and tips
- Substack - Newsletter for data engineers
- DataEngineer.wiki - Cheat sheets, learning paths, cert guides
Prepping for interviews? Writing code is one half of the battle - knowing the questions that actually come up is the other. I maintain Databricks Interview Cheat Sheets by seniority level (junior / mid / senior / bundle).
Fluency comes from reps, not reading. Three structured paths:
exercises/- focused reps on a single concept. LeetCode-style, 5-30 min each.pipeline-labs/- end-to-end medallion pipelines on a business scenario. 2-3 hours each.deep-dives/- measure the impact of a technique with numbers. 1-2 hours each.
| Exercises | Pipeline Labs | Deep-Dives | |
|---|---|---|---|
| Format | Single notebook, one TODO per exercise | Multi-notebook guided project | Single-topic deep investigation |
| Time | 5-30 min per exercise | 2-3 hours per lab | 1-2 hours |
| Scope | One concept (MERGE, window functions, ...) | End-to-end project (ingestion -> bronze -> silver -> gold) | One topic measured in depth |
| Narrative | None. "Given table X, write..." | Business scenario. "You're building a streaming pipeline for..." | Benchmark-driven. "Apply technique, measure the delta." |
| Order | Pick any, skip around | Sequential notebooks that build on each other | Sequential; each step layers on the last |
| Goal | Drill a skill until it's automatic | See how concepts fit in a real project | Prove what a technique actually buys you |
| Topic | Notebooks | Exercises | Description |
|---|---|---|---|
| Delta Lake | 6 | 51 | MERGE operations, time travel, schema enforcement, OPTIMIZE, liquid clustering, change data feed |
| ELT | 7 | 53 | Spark SQL joins, window functions, PySpark transformations, Auto Loader, batch ingestion, medallion architecture, complex data types |
Total: 13 notebooks, 104 exercises
More exercise topics coming - next up: Streaming, Unity Catalog, Performance, and DLT.
Multi-notebook, end-to-end medallion pipelines with a business scenario. Each runs 2-3 hours and ships with a synthetic data generator.
| Lab | What You Build | Focus |
|---|---|---|
| Apparel Retail 360 (DLT) | End-to-end retail analytics pipeline on Delta Live Tables with a full medallion architecture. | DLT, Medallion, SCD Type 2, Streaming, Data Quality Expectations |
| Fintech Transaction Monitoring | Real-time fraud-monitoring pipeline for a payment processor handling 500K+ transactions/day. | Structured Streaming, Rescued Data, Watermarked Dedup, Stream-Static Joins, Liquid Clustering |
| DE Associate Certification Prep | Production-grade pipeline covering every exam domain of the Databricks Data Engineer Associate cert. | Auto Loader, COPY INTO, Medallion, SCD2, Jobs, Unity Catalog |
| PySpark Developer Cert Prep | E-commerce analytics pipeline covering every domain of the Spark Developer Associate cert. | DataFrame API, Structured Streaming, Data Skew, Performance Tuning |
Single-topic labs that measure the impact of a technique with numbers, not intuition.
| Lab | What You Build | Focus |
|---|---|---|
| 6 Delta Optimization Techniques | Iteratively apply and measure core Delta performance levers on a synthetic 50M-row dataset. | Partitioning, Z-Order, OPTIMIZE, Auto Optimize, Liquid Clustering, VACUUM |
- Sign up for Databricks Free Edition (free, no credit card)
- Clone or import this repo into Databricks (Workspace -> Create -> Git folder)
- Navigate to the folder you want, open its README, follow the instructions
Everything runs on Free Edition: serverless compute, Unity Catalog, Delta Lake. No cloud account, no cluster config.
- New to Databricks? Start with DE Associate Cert Prep - broadest fundamentals.
- Want quick reps on a specific concept? Delta Lake exercises or ELT exercises - drill one concept at a time.
- Comfortable with batch, new to streaming? Apparel DLT, then Fintech Monitoring.
- Preparing for a cert? DE Associate or Spark Developer Associate.
- Already shipping pipelines, want to go deeper on performance? Delta Optimization Techniques.
New exercises and labs ship regularly. Follow on LinkedIn or subscribe to the Substack newsletter to be notified when new content drops.
Found a bug? Have a suggestion? Open an issue.
Disclaimer: This is an independent educational resource created by Jakub Lasak. Not affiliated with, endorsed by, or sponsored by Databricks, Inc. "Databricks" and "Delta Lake" are trademarks of their respective owners.