Independent educational resource. Not affiliated with or endorsed by Databricks, Inc.
Databricks has hundreds of features, configs, and services. Here are the 100 that actually matter.
Go through each category. Be honest: can you explain each concept to a colleague without looking anything up? If not, that's your gap.
Easy = what you should know after month one. Medium = solid mid-level knowledge. Hard = senior-level depth. All concepts reflect 2026 Databricks best practices.
Built by dataengineer.wiki - cheat sheets, exercises, labs, and career tools for Databricks data engineers.
Go through every concept. Can you explain it without looking anything up? Count your score.
| Your Score | What It Means |
|---|---|
| 80-100 | Senior-level. You can architect and debug production systems. Focus on the Hard concepts you missed. |
| 50-79 | Solid mid-level. You know the daily work but have gaps in internals and advanced patterns. |
| 30-49 | Building foundations. Focus on Easy and Medium concepts in Categories 1, 3, and 4 first. |
| Under 30 | Early stage. Start with all Easy concepts. Come back in 3 months and re-score. |
Share your score. Send this to a colleague who's prepping for Databricks interviews or certification.
One concept per day. 100 days. Post what you learned on LinkedIn with #100DaysOfDatabricks.
- Fork this repo
- One concept per day, starting at #1
- Post your takeaway on LinkedIn
- Check it off and move on
Day 1: ACID Transactions. Day 100: Testing Patterns. By the end, you can explain every concept a senior Databricks engineer needs to know.
| # | Category | Concepts | Focus |
|---|---|---|---|
| 1 | Delta Lake | #1 - #10 | Storage layer fundamentals |
| 2 | Spark Execution | #11 - #20 | How the engine works |
| 3 | SQL & DataFrames | #21 - #30 | Daily transformation work |
| 4 | Data Ingestion | #31 - #40 | Getting data in |
| 5 | Streaming | #41 - #50 | Real-time processing |
| 6 | Performance & Cost | #51 - #60 | Making it fast and cheap |
| 7 | Unity Catalog & Governance | #61 - #70 | Security and access control |
| 8 | Lakeflow Declarative Pipelines | #71 - #80 | Managed pipeline engine (formerly DLT) |
| 9 | Workflows, CI/CD & Operations | #81 - #90 | Running it in production |
| 10 | Architecture & Advanced Patterns | #91 - #100 | Senior-level decisions |
The storage layer that makes Databricks a lakehouse. Every table you create, query, or maintain runs on Delta.
| # | Concept | Difficulty | What You Should Know |
|---|---|---|---|
| 1 | ACID Transactions | Easy | How Delta Lake guarantees atomicity, consistency, isolation, and durability on top of cloud object storage. Optimistic concurrency control. Why this matters vs. raw Parquet. |
| 2 | Transaction Log (_delta_log) |
Easy | How the JSON-based transaction log tracks every change. What happens during reads (snapshot isolation). Log checkpoints every 100 commits. How the log enables time travel. |
| 3 | Time Travel & RESTORE | Easy | Querying previous versions with VERSION AS OF and TIMESTAMP AS OF. Restoring tables with RESTORE. When time travel stops working (after VACUUM). Retention configuration. |
| 4 | Schema Enforcement vs. Evolution | Medium | How mergeSchema and overwriteSchema work. What happens when a write doesn't match the schema. When to use enforcement vs. evolution in production. |
| 5 | Liquid Clustering | Medium | The recommended layout strategy for all new tables. How CLUSTER BY works. Replaces legacy partitioning and Z-ORDER. Clustering triggered on writes that exceed size thresholds. Why OPTIMIZE is still recommended for comprehensive clustering. |
| 6 | OPTIMIZE & File Compaction | Medium | Why small files kill read performance (causes: frequent appends, streaming micro-batches). How OPTIMIZE rewrites files and triggers Liquid Clustering. Auto Compaction as a lighter-weight alternative. When to rely on Predictive Optimization (#53) vs. running manually. |
| 7 | VACUUM & Storage Lifecycle | Medium | How VACUUM removes unreferenced files. The 7-day default retention and why lowering it breaks time travel. Interaction with concurrent readers. How Predictive Optimization automates VACUUM on managed tables. |
| 8 | Change Data Feed (CDF) | Medium | How to enable and query table_changes(). The _change_type column (insert, update_preimage, update_postimage, delete). Use cases: incremental downstream processing, audit trails, replication. |
| 9 | Deletion Vectors | Medium | Soft-delete markers that avoid rewriting files on DELETE, UPDATE, and MERGE. Enabled by default on new tables (rolling out). Physical deletion happens at OPTIMIZE/compaction. Compatibility considerations for external readers. |
| 10 | Delta Table Properties & Configuration | Hard | Key table properties every DE manages: delta.enableChangeDataFeed, delta.enableDeletionVectors, delta.columnMapping.mode, delta.minReaderVersion/minWriterVersion, delta.checkpointInterval. How these affect compatibility, performance, and feature availability. Checking and setting with ALTER TABLE SET TBLPROPERTIES. |
Spark is the compute engine. Senior engineers don't just write transformations; they understand how Spark executes them and why it matters.
| # | Concept | Difficulty | What You Should Know |
|---|---|---|---|
| 11 | Lazy Evaluation & Actions | Easy | Why transformations build a plan but don't execute. Which operations are actions (collect, count, write, show) vs. transformations (filter, select, join). Why this enables optimization. |
| 12 | Photon Engine | Easy | Native vectorized C++ query engine for SQL and DataFrame operations. Which operations benefit most (scans, filters, aggregations, joins). How to verify Photon is active in query plans. Automatically enabled on serverless compute. |
| 13 | Catalyst Optimizer & Query Plans | Medium | How Catalyst transforms logical plans into physical plans. Reading EXPLAIN output. Recognizing when the optimizer makes good vs. poor decisions. |
| 14 | Adaptive Query Execution (AQE) | Medium | How AQE adjusts plans at runtime based on actual data statistics. Three key optimizations: coalescing shuffle partitions, converting sort-merge joins to broadcast, optimizing skew joins. Enabled by default. |
| 15 | Shuffle Operations | Medium | What triggers a shuffle (groupBy, join, repartition, distinct). Why shuffles are expensive (serialization + network + disk I/O). Reading shuffle metrics in Spark UI. |
| 16 | Join Strategies | Medium | The three join types: broadcast hash join (small table fits in memory), sort-merge join (default for large-large), shuffle hash join (rare). The 10MB broadcast threshold (higher with Photon). Using broadcast() hints. How AQE dynamically converts at runtime. |
| 17 | Reading the Spark UI | Medium | The tabs that matter: Jobs, Stages, Tasks, SQL. What to look for: task duration variance, spill metrics, shuffle read/write sizes, scan statistics. Diagnosing bottlenecks from the UI. |
| 18 | Partitioning & Parallelism | Medium | How spark.sql.shuffle.partitions controls parallelism after shuffles. When to repartition vs. coalesce. The relationship between partition count, core count, and task parallelism. How to diagnose over/under-partitioning from Spark UI task metrics. |
| 19 | Data Skew: Detection & Mitigation | Hard | Recognizing skew in Spark UI (one task takes 10x longer). Why skew kills parallelism. Salting technique for skewed keys. AQE's automatic skew join optimization. When manual salting is still needed. |
| 20 | Memory Management & Spill | Hard | The unified memory model (execution + storage pool). What causes spill to disk. How to diagnose spill from Spark UI. Tuning spark.sql.shuffle.partitions and executor memory. |
The daily work. Transforming data, writing queries, building ELT logic.
| # | Concept | Difficulty | What You Should Know |
|---|---|---|---|
| 21 | DataFrame API: Core Operations | Easy | select, filter, withColumn, groupBy, agg, join. Method chaining patterns. When to use DataFrame API vs. Spark SQL (same execution plan either way). |
| 22 | Temp Views & Global Temp Views | Easy | Scope: temp views live in the SparkSession, global temp views in global_temp and survive across notebooks on the same cluster. Neither persists across restarts. |
| 23 | Window Functions | Medium | ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, running totals with SUM OVER. Partition and ordering clauses. How these translate to physical execution (shuffles on partition keys). |
| 24 | Complex Types: ARRAY, STRUCT, MAP | Medium | Creating and querying nested structures. Exploding arrays with explode() and posexplode(). Struct field access with dot notation. Flattening deeply nested JSON. |
| 25 | Higher-Order Functions | Medium | transform(), filter(), exists(), aggregate() for array manipulation without exploding. When these outperform explode + groupBy. Real patterns: cleaning arrays, conditional filtering. |
| 26 | CTEs, PIVOT & Subqueries | Medium | Writing CTEs for readable multi-step queries. PIVOT for wide-format reshaping. Correlated vs. uncorrelated subqueries. How these map to physical plans. |
| 27 | UDFs vs. Native Functions | Medium | Why Python UDFs break Catalyst and force row-by-row execution. The 10-100x performance gap. Pandas UDFs (vectorized) as a middle ground. When a UDF is genuinely the right choice. |
| 28 | VARIANT Type | Medium | Storing semi-structured data (JSON) in a native binary format. Schema-on-read without parsing overhead. Querying with : path syntax. When VARIANT outperforms STRING or STRUCT columns for JSON data. |
| 29 | MERGE INTO & Upsert Patterns | Hard | The MERGE syntax for insert/update/delete in one statement. Clause ordering (MATCHED before NOT MATCHED). Write amplification and why MERGE rewrites entire files. Optimization: match on Liquid Clustering keys. How deletion vectors reduce cost. |
| 30 | Table Constraints & Generated Columns | Medium | CHECK constraints, NOT NULL, PRIMARY KEY (informational), FOREIGN KEY (informational). Generated columns with expressions (e.g., date GENERATED ALWAYS AS CAST(timestamp AS DATE)). How constraints enforce data quality at the write layer. |
Getting data into the lakehouse reliably, incrementally, and at scale.
| # | Concept | Difficulty | What You Should Know |
|---|---|---|---|
| 31 | COPY INTO vs. Auto Loader | Easy | COPY INTO for ad-hoc/batch loads (idempotent, simpler). Auto Loader for continuous/streaming ingestion (exactly-once, scalable, schema evolution). Why Auto Loader is preferred for production. |
| 32 | File Formats in the Lakehouse | Easy | Why Parquet is the default (columnar, compressed, schema-embedded). Delta stores Parquet under the hood. When to use JSON, CSV, or Avro for ingestion sources. Schema-on-read vs. schema-on-write. |
| 33 | Volumes | Easy | Governed access to non-tabular files (CSVs, images, ML models) within Unity Catalog. Managed vs. external volumes. Using Volumes as landing zones and staging areas. |
| 34 | Auto Loader: File Discovery | Medium | File notification mode (cloud events, efficient at scale, requires setup) vs. directory listing mode (simpler, slower with millions of files). When to use each. Why Auto Loader is the default for production ingestion. |
| 35 | Auto Loader: Schema Evolution & Rescued Data | Medium | How schema inference works from the first batch. cloudFiles.schemaEvolutionMode options (addNewColumns, rescue, failOnNewColumns, none). The _rescued_data column for capturing mismatches instead of failing. |
| 36 | Multi-Hop Medallion Architecture | Medium | Bronze (raw, append-only), Silver (cleaned, deduplicated, typed), Gold (business aggregates). Why this pattern exists. When it's overkill. How each layer serves different consumers with different freshness requirements. |
| 37 | Incremental Processing Patterns | Medium | Processing only new/changed data. Techniques: Auto Loader (file-level), Change Data Feed (row-level), watermark-based (time-based), merge keys (hash-based). Cost vs. latency tradeoffs for each. |
| 38 | Lakehouse Federation | Medium | Querying external databases (PostgreSQL, MySQL, SQL Server, Snowflake) without copying data. Query pushdown optimization. Connection objects and security. When federation beats ETL and when it doesn't. |
| 39 | Idempotent Pipeline Design | Hard | Why every production pipeline must produce the same result when re-run. Databricks-specific techniques: MERGE with natural keys, INSERT OVERWRITE with replaceWhere, Auto Loader checkpoint recovery, Workflows retry policies. How streaming checkpoints provide built-in idempotency. |
| 40 | SCD Type 1 & Type 2 Patterns | Hard | SCD1 (overwrite current values) vs. SCD2 (maintain full history with effective dates). Implementing SCD2 with MERGE. Why Lakeflow Declarative Pipelines' AUTO CDC INTO (#75) is simpler for SCD2 at scale. |
Processing data as it arrives. Databricks Structured Streaming and its integration with the lakehouse.
| # | Concept | Difficulty | What You Should Know |
|---|---|---|---|
| 41 | Structured Streaming Fundamentals | Medium | How Structured Streaming on Databricks uses Delta Lake as both source and sink. The readStream/writeStream API. Why Delta-to-Delta streaming is the most common production pattern. Integration with Auto Loader as a file source. |
| 42 | Triggers | Medium | processingTime (recurring micro-batches at fixed intervals). availableNow (process all available data then stop; replaces deprecated once). Choosing the right trigger for latency vs. cost. |
| 43 | Output Modes | Medium | Append (new rows only, default). Complete (full result table, aggregations only). Update (changed rows only). Which modes work with which operations and sinks. |
| 44 | Stream-Static Joins | Medium | Joining streaming DataFrames with batch DataFrames. Static side re-read each micro-batch. Use cases: enriching events with dimension data. Outer join limitations on the streaming side. |
| 45 | Windowed Aggregations | Medium | Tumbling windows (fixed, non-overlapping), sliding windows (fixed, overlapping), session windows (gap-based). Combining windows with watermarks for bounded state. |
| 46 | foreachBatch Pattern | Medium | Using foreachBatch() to apply arbitrary batch logic to each micro-batch. Use cases: writing to multiple sinks, MERGE operations in streaming, calling external APIs. Exactly-once considerations. |
| 47 | Streaming Tables in Lakeflow Pipelines | Medium | How streaming tables in Lakeflow Declarative Pipelines abstract away checkpoint and state management. The difference between streaming tables and raw Structured Streaming. When to use each approach. |
| 48 | Checkpointing & Exactly-Once | Hard | How checkpoint directories track processing state. Stream restart behavior (replay from checkpoint). Why checkpoint location must be stable and unique per query. Limitations of exactly-once guarantees. |
| 49 | Watermarks & Late Data | Hard | How withWatermark() bounds how late data can arrive. The tradeoff: longer watermark = more complete results but higher memory/state. How Spark drops state for expired windows. |
| 50 | Stream-Stream Joins | Hard | Joining two streaming DataFrames. Watermark requirements for state cleanup. Inner, left outer, and right outer join semantics. State management implications. |
The difference between a pipeline that costs $50/month and one that costs $5,000/month.
| # | Concept | Difficulty | What You Should Know |
|---|---|---|---|
| 51 | Serverless Compute Economics | Easy | How serverless pricing works (DBU-based, pay per query/task vs. per cluster-hour). When serverless is cheaper vs. classic (bursty workloads vs. sustained compute). Why serverless eliminates idle cluster waste. |
| 52 | Predicate Pushdown & Data Skipping | Medium | How Spark skips reading files/partitions that can't match filter predicates. File-level min/max column statistics. What breaks pushdown (casting, functions on filter columns). How Liquid Clustering keys maximize skipping. |
| 53 | Predictive Optimization | Medium | Databricks automatically runs OPTIMIZE, VACUUM, and ANALYZE on managed tables based on query patterns. Enabled by default on new accounts. Why you rarely need manual maintenance jobs on managed tables. Monitoring via system tables. |
| 54 | Column Statistics & File Pruning | Medium | Per-file min/max statistics collected for the first 32 columns by default (managed tables with Predictive Optimization choose columns intelligently). Why column order matters. How ANALYZE TABLE refreshes stats for the cost-based optimizer. |
| 55 | Write Optimization | Medium | optimizeWrite for automatic file sizing on write. maxRecordsPerFile for controlling output file sizes. When to coalesce before writing. Target file sizes for different workloads. How write optimization prevents the small file problem before it starts. |
| 56 | Cluster Sizing & Autoscaling | Medium | Right-sizing for ETL vs. ML vs. SQL analytics workloads. Autoscaling min/max workers. Spot instances for cost savings. The relationship between partition count, core count, and parallelism. |
| 57 | DBU Cost Model & Billing | Medium | How Databricks bills (DBUs by SKU: Jobs, SQL, All-Purpose, Serverless). Understanding the billing system tables (system.billing.usage). Cluster policies and budgets to prevent runaway costs. Tagging jobs for cost allocation and chargeback. |
| 58 | Query Profile & Execution Plans | Medium | Reading the DBSQL Query Profile (visual plan, statistics, bottleneck identification). The difference between logical and physical plans. Identifying slow operators: scans, joins, aggregations, sorts. |
| 59 | Join Performance Diagnosis | Hard | When a join is the bottleneck: reading shuffle sizes, spill metrics, and task skew in Spark UI for join stages. Tuning spark.sql.autoBroadcastJoinThreshold. Choosing between forcing broadcast hints vs. redesigning the query. When to pre-aggregate or filter before joining to reduce shuffle volume. |
| 60 | MERGE Write Amplification | Hard | Why MERGE rewrites entire files containing matched rows. Optimization: match on Liquid Clustering keys, leverage deletion vectors, right-size target files. Measuring amplification with DESCRIBE HISTORY. |
Security, access control, and data governance. Required for every production Databricks environment.
| # | Concept | Difficulty | What You Should Know |
|---|---|---|---|
| 61 | Three-Level Namespace | Easy | catalog.schema.table hierarchy. How it maps to organizational structure. Default catalog and schema settings. Migrating from the legacy Hive Metastore. |
| 62 | Permission Model & Inheritance | Medium | How privileges cascade down (catalog -> schema -> table). Key privileges: USE CATALOG, USE SCHEMA, SELECT, MODIFY, CREATE TABLE. Owner vs. granted access. GRANT and REVOKE syntax. |
| 63 | Dynamic Views for Security | Medium | Using CURRENT_USER() and IS_MEMBER() in view definitions. Column masking patterns (full SSN for HR, masked for everyone else). When to use dynamic views vs. row filters/column masks. |
| 64 | Data Lineage | Medium | How Unity Catalog automatically tracks column-level lineage across notebooks, jobs, and SQL. Viewing lineage in Catalog Explorer. Using lineage for impact analysis before schema changes. |
| 65 | Information Schema & Metadata Queries | Medium | Querying system.information_schema.tables, .columns, .table_privileges to discover and audit catalog contents. Building compliance reports from metadata. Automating governance checks. |
| 66 | External Locations & Storage Credentials | Medium | Mapping cloud storage paths to Unity Catalog governance. Storage credentials (IAM roles, service principals). When to use managed vs. external tables based on data ownership requirements. |
| 67 | Service Principals & Managed Identity | Medium | Using service principals for automated jobs instead of personal credentials. Managed identity on Azure, instance profiles on AWS. Why production workloads must never run under personal accounts. |
| 68 | Audit Logging | Medium | system.access.audit for tracking who did what and when. Querying audit logs for compliance, access reviews, and security investigations. Building audit dashboards. |
| 69 | Row Filters & Column Masks | Hard | Applying row-level and column-level security directly to tables via SQL UDFs. ALTER TABLE SET ROW FILTER. ALTER TABLE ALTER COLUMN SET MASK. When to use these vs. dynamic views. |
| 70 | Delta Sharing | Hard | Secure, cross-organization data sharing without copying data. Open protocol (works with non-Databricks recipients). Share objects, recipients, activation links. Sharing live tables vs. snapshots. |
The managed pipeline engine for reliable, maintainable data pipelines. Formerly known as Delta Live Tables (DLT).
| # | Concept | Difficulty | What You Should Know |
|---|---|---|---|
| 71 | Python vs. SQL Pipeline Syntax | Easy | SQL: CREATE STREAMING TABLE, CREATE MATERIALIZED VIEW. Python: @table, @materialized_view, @temporary_view decorators (from pyspark.pipelines). Old import dlt syntax still works. Choosing based on team skills. |
| 72 | Streaming Tables vs. Materialized Views | Medium | Streaming tables for append-only ingestion with exactly-once guarantees. Materialized views for aggregations that need full recomputation. How the engine manages refresh, schema, and state automatically. |
| 73 | Expectations (Quality Rules) | Medium | Defining data quality constraints that warn, drop, or fail on violation. Monitoring expectation pass/fail rates in the pipeline UI. Using expectations as production data quality gates. |
| 74 | Pipeline Development & Testing | Medium | Development vs. production modes. Incremental vs. full refresh. Testing pipeline logic locally before deployment. The pipeline settings and configuration options. |
| 75 | Pipeline Monitoring & Event Log | Medium | Reading the event log for pipeline run history, data quality metrics, and lineage. Setting up alerts on pipeline failures or expectation violations. Integration with Databricks SQL dashboards. |
| 76 | Pipeline Modes: Triggered vs. Continuous | Medium | Triggered pipelines (batch-like, process available data then stop) vs. continuous pipelines (always-on, sub-second latency). Cost and latency tradeoffs. When each mode is appropriate. |
| 77 | CDC Processing: AUTO CDC INTO | Hard | Processing Change Data Capture feeds with AUTO CDC INTO (SQL) or create_auto_cdc_flow() (Python). Formerly APPLY CHANGES INTO. Automatic SCD Type 1 and Type 2 handling. Sequencing columns and deduplication. |
| 78 | Error Handling & Dead Letter Patterns | Hard | Routing bad records to quarantine tables instead of failing the pipeline. Using expectations with ON VIOLATION DROP ROW plus a separate quality monitoring table. Building observability into pipeline design. |
| 79 | Pipeline Parameters & Configuration | Hard | Parameterizing pipelines for dev/staging/prod environments. Configuration management patterns. Passing runtime settings. Managing pipeline dependencies across environments. |
| 80 | Multi-Pipeline Architecture | Hard | Splitting large pipelines into smaller, focused pipelines with clear boundaries. Sharing tables between pipelines. Managing dependencies and execution ordering. When a single pipeline becomes too complex to maintain. |
Running Databricks in production. The difference between a notebook that works and a system that's reliable.
| # | Concept | Difficulty | What You Should Know |
|---|---|---|---|
| 81 | Choosing Compute for Jobs | Easy | Serverless compute as the default for new jobs (no cluster management, Photon auto-enabled). When to use job clusters instead (custom init scripts, GPU workloads, unsupported task types). Why all-purpose clusters are for development, not production. Instance pools for fast classic cluster startup. |
| 82 | Secrets Management | Easy | Databricks Secret Scopes for credentials (API keys, database passwords). dbutils.secrets.get() in notebooks. Key-vault backed scopes (Azure Key Vault, AWS Secrets Manager). Why secrets must never be hardcoded. |
| 83 | Parameterization: Widgets, Job Parameters, Task Values | Easy | dbutils.widgets for notebook parameters. Job parameters passed from Workflows. Task values for inter-task communication. Databricks SQL parameters for parameterized queries. |
| 84 | Notebook Development Patterns | Easy | %run for shared utilities. dbutils.notebook.run() for orchestration. Magic commands (%sql, %python, %md). Notebook-scoped libraries. When to use notebooks vs. Python files in production. |
| 85 | Databricks Workflows: Multi-Task Jobs | Medium | Creating jobs with multiple tasks (notebooks, Python scripts, SQL, dbt, JARs). Task dependencies (linear, fan-out, fan-in). Run conditions and error handling. |
| 86 | Task Dependencies, Retries & Conditional Logic | Medium | Per-task retry policies and timeout limits. Conditional execution based on upstream task outcomes. Task values for passing data between tasks. Email and webhook alerting on failure. |
| 87 | Git Integration & Repos | Medium | Connecting Databricks to GitHub/GitLab/Azure DevOps. Databricks Repos for version-controlled notebooks. Branch-based development. Pull request workflows for notebook code. |
| 88 | Monitoring, Alerting & SQL Alerts | Medium | Setting up SQL alerts in DBSQL (threshold-based notifications). System tables for operational monitoring. Cluster event logs. Integrating with PagerDuty, Slack, or email for pipeline failures. |
| 89 | System Tables for Operations | Medium | system.billing.usage for cost tracking and DBU consumption by SKU. system.compute.clusters for cluster lifecycle and utilization. Building operational dashboards for cost monitoring and capacity planning. |
| 90 | Databricks Asset Bundles (DABs) | Hard | The databricks.yml configuration file. Defining jobs, pipelines, and models as code. Environment targets (dev/staging/prod). databricks bundle validate/deploy/destroy. GitHub Actions integration. |
Senior-level decisions. Understanding tradeoffs, designing systems, managing complexity.
| # | Concept | Difficulty | What You Should Know |
|---|---|---|---|
| 91 | Shallow vs. Deep Clones | Easy | SHALLOW CLONE (zero-copy, references source files) vs. DEEP CLONE (full copy, independent). Use cases: dev/test environments, table migration, disaster recovery snapshots, schema evolution testing. |
| 92 | Managed vs. External Tables: Architecture Tradeoffs | Medium | When managed tables are the right default (governance, Predictive Optimization, automatic lifecycle). When external tables are necessary (multi-engine access, data ownership boundaries, regulatory requirements). |
| 93 | Legacy: Partitioning & Z-ORDER | Medium | Why these are legacy approaches superseded by Liquid Clustering. When partitioning still makes sense (very large tables, low-cardinality columns, non-Databricks reader compatibility). Z-ORDER mechanics for maintaining legacy tables. Migration planning. |
| 94 | Lakehouse Table Design Patterns | Medium | When to denormalize vs. normalize in a lakehouse (wide tables for analytics, normalized for transactional). Star schema in Gold layer. Choosing between many small tables vs. fewer wide tables. Clustering key selection and column ordering for statistics. |
| 95 | Delta UniForm & Multi-Engine Interoperability | Medium | How UniForm enables a single Delta table to be read as Apache Iceberg or Apache Hudi. When to use UniForm vs. Delta Sharing vs. Lakehouse Federation for cross-platform access. Enabling and configuring. |
| 96 | Compute Policies & Cluster Governance | Medium | Using cluster policies to control what users can provision (instance types, max workers, spot ratios, auto-termination). Preventing cost overruns. Balancing developer freedom with budget discipline. |
| 97 | AI Functions for Data Engineers | Medium | Using ai_query() and ai_generate() SQL functions for data enrichment, classification, and extraction within pipelines. When AI functions replace complex regex/UDF logic. Cost and latency considerations. |
| 98 | Multi-Workspace Architecture | Hard | When to use one workspace vs. many (team isolation, compliance boundaries, regional deployment). Sharing Unity Catalog metastore across workspaces. Cross-workspace job orchestration. Account-level vs. workspace-level governance. |
| 99 | Performance Troubleshooting Methodology | Hard | A structured approach to diagnosing slow pipelines: check Spark UI stage breakdown, identify shuffle/spill bottlenecks, examine query plan for full scans or bad joins, verify clustering/statistics, check cluster utilization. Building runbooks for common performance patterns. |
| 100 | Testing Patterns for Data Pipelines | Hard | Unit testing transformations with sample DataFrames. Integration testing with representative data subsets. Data quality assertions beyond Lakeflow expectations. Testing schema evolution scenarios. CI/CD test pipelines with DABs. |
| Resource | What It Covers | Link |
|---|---|---|
| DataDojo | 633+ practice exercises across all categories | dojo.dataengineer.wiki |
| Data Engineer Wiki | Cheat sheets, learning paths, cert prep guides | dataengineer.wiki |
| Hands-On Labs | Production-grade Databricks labs on GitHub | GitHub Labs |
| Interview Cheat Sheet | 86 senior-level Databricks interview Q&A | dataengineer.wiki/cheat-sheet-senior |
| Databricks Academy | Official free courses | academy.databricks.com |
| Databricks Docs | Official reference | docs.databricks.com |
Found an error? Think a concept should be added or replaced? Open an issue or PR.
Inclusion criteria: A concept must be something 80%+ of Databricks data engineers encounter at least monthly in production. Niche edge cases don't make the list.
Why 100? Databricks has hundreds of features. These are the 100 concepts Databricks data engineers encounter most often in production. Every concept earns its spot through a universality test: if most engineers rarely encounter it, it's not on the list.
Is this for the Associate or Professional cert? Both. The Associate covers roughly 60 of the 100. The Professional covers all 100 at production depth.
I scored under 50. Should I panic? No. Most engineers have gaps they don't know about. That's the point of this list. Focus on the Easy concepts first, re-score in 3 months.
Can I use this for interview prep? Yes. These concepts appear in real Databricks data engineering interviews. The Interview Cheat Sheet has detailed answers for the most common questions.
Something is outdated or wrong. Open an issue. Databricks evolves fast. This list reflects 2026 best practices (Liquid Clustering over Z-ORDER, Lakeflow Declarative Pipelines over DLT, serverless as default, Predictive Optimization, etc.). We update when the platform changes.
MIT License. Use it, share it, fork it.
Ready to start? Fork the repo, pick concept #1, and post your first #100DaysOfDatabricks on LinkedIn.
Built and maintained by dataengineer.wiki.
