SQL Server · ETL Pipelines · Medallion Architecture · Star Schema Modeling
A hands-on data engineering project building a complete data warehouse from scratch using SQL Server and the Medallion Architecture (Bronze, Silver, Gold layers). This project demonstrates end-to-end data pipeline development, from raw data ingestion to business-ready analytics models.
As a data analyst, I usually work with data that has already been cleaned, transformed, and modeled. While this allows me to focus on insights, it also means that much of the upstream logic—how data arrives in its final form and why certain decisions are made—can remain hidden.
To close this gap, I built a complete end-to-end data warehouse by following the Data With Baraa tutorial. The goal was not to become a data engineer, but to gain a deeper understanding of the processes that make reliable analytics possible.
I learned how to trace data from source systems (ERP/CRM) through multiple transformation layers, understanding which business rules are applied and how they impact analytical results. This helps me better assess data reliability and explain metric behavior with proper context.
Working directly with raw data exposed common data quality issues such as duplicates, missing values, and inconsistencies. I now understand why these issues occur, how to detect them early, and which fields require validation before analysis.
By designing fact and dimension tables myself, I gained practical insight into surrogate keys, schema design trade-offs, and performance considerations. This allows me to write more efficient queries and suggest modeling improvements aligned with analytical use cases.
I can now discuss ETL logic, incremental vs. full loads, pipeline performance, and integration challenges using the correct technical language, enabling more effective collaboration and clearer requirements.
I learned how to structure reusable transformations in the Silver layer and build analyst-friendly views in the Gold layer, reducing repetitive fixes and improving consistency across reports and dashboards.
This project strengthened my analytical skill set by giving me full visibility into how data is prepared. As a result, I can:
- Ask better analytical questions
- Validate insights more critically
- Communicate findings with clear data context
- Collaborate more effectively with data engineers
- Build more robust and maintainable analyses
This project follows the Medallion Architecture with three distinct layers:
- Purpose: Store raw, unprocessed data exactly as received from source systems
- Source Systems: ERP (sales transactions) and CRM (customer information)
- Format: CSV files loaded via
BULK INSERT - Key Learning: Implemented full-load ETL with
TRUNCATE TABLE+BULK INSERT, error handling withTRY/CATCH, and performance measurement usingDATEDIFF()
- Purpose: Apply data quality transformations without changing the data model
- Transformations Applied:
- Deduplication: Used
ROW_NUMBER() OVER (PARTITION BY...)to keep only the most recent records - Standardization: Replaced codes (F/M) with full descriptions (Female/Male) using
CASE WHEN - Null Handling: Replaced missing values with business-friendly defaults ('Not Available', 0)
- Date Enrichment: Used
LEAD()window function to calculate end dates for historical records - Type Casting: Converted integer date fields to proper
DATEtypes - Business Rules: Recalculated incorrect metrics (Sales = Quantity × Price) using
ABS()andNULLIF()
- Deduplication: Used
- Key Learning: Gained hands-on experience with data quality patterns I frequently encounter in analysis—now I understand how they should be fixed at the source
- Purpose: Provide consumption-ready data modeled for analytical queries
- Data Model: Star schema with fact and dimension tables
- Implementation: SQL views (not materialized tables) for flexibility
- Key Components:
- Surrogate Keys: Generated using
ROW_NUMBER()for dimension primary keys - Data Integration: Combined CRM + ERP data using
LEFT JOINandCOALESCE()to pick the best source - Fact Table:
fact_saleswith foreign keys to all dimensions - Dimensions:
dim_customers,dim_products,dim_stores,dim_date
- Surrogate Keys: Generated using
- Key Learning: Understood why star schemas are optimal for analytics (simplified joins, clear business context, performance)
- DDL (Data Definition Language): Created schemas, tables, and views with proper naming conventions (snake_case)
- DML (Data Manipulation Language):
TRUNCATE,INSERT INTO...SELECT,BULK INSERT - Window Functions:
ROW_NUMBER(),LEAD(),PARTITION BY,ORDER BY - Joins:
LEFT JOIN,INNER JOINfor multi-source data integration - Functions:
TRIM(),SUBSTRING(),REPLACE(),CAST(),COALESCE(),ABS(),NULLIF(),GETDATE(),DATEDIFF() - Stored Procedures: Encapsulated ETL logic in reusable procedures (
bronze.load_bronze,silver.load_silver) - Error Handling:
BEGIN TRY...END TRY/BEGIN CATCH...END CATCHblocks - Performance Measurement: Calculated pipeline execution time for optimization
- ETL Pipelines: Extract, Transform, Load workflows with full-load strategy
- Data Quality: Deduplication, null handling, standardization, validation
- Data Integration: Combining multiple source systems into unified dimensions
- Data Modeling: Star schema design with fact and dimension tables
- Surrogate Keys: Generated non-business keys for dimensional integrity
- Data Lineage: Tracked data flow from Bronze → Silver → Gold with metadata columns (
DW_create_date)
- Naming Conventions: Consistent snake_case for all database objects
- Schema Organization: Logical separation using Bronze/Silver/Gold schemas
- Version Control: Git repository with structured folders (scripts, docs, tests, datasets)
- Documentation: Data architecture diagrams, data flow diagrams, data catalog
- Idempotency: Scripts check for object existence and recreate cleanly
-
Data Quality is Foundational: 80% of analytics challenges stem from data quality issues—deduplication, null handling, and standardization should happen at the source, not in every analysis query
-
Context Matters More Than Code: Understanding why data is structured a certain way (e.g., why surrogate keys exist, why dates are enriched with
LEAD()) is more valuable than just knowing how to write the SQL -
Collaboration Requires Shared Language: Speaking the same technical language as data engineers (ETL, pipeline, surrogate keys, data lineage) makes cross-functional work exponentially more effective
-
Upstream Thinking Improves Analysis: When I encounter unexpected results in analysis, I now instinctively think: "Was this handled in Silver? Should it be? Or is this a legitimate business pattern?"
-
Documentation is Not Optional: The data catalog and architecture diagrams were essential for understanding the project—and they're just as essential in production environments
This project was built following the excellent tutorial series by Data With Baraa. All credit for the instructional content, project structure, and teaching methodology goes to Baraa Khatib Salkini.
Original Tutorial:
- Course: SQL Data Warehouse Project
- Materials: Download
- GitHub: Original Repo
This README and repository represent my personal implementation and learning journey through the tutorial, documented for portfolio and educational purposes.
This project is for educational and portfolio purposes. The original tutorial and project structure are created by Data With Baraa and are used here with proper attribution for learning purposes.
This project is part of my learning journey to become a more well-rounded data analyst by understanding the full data lifecycle—from raw ingestion to analytics-ready models.
