📊 Data Warehouse Project: Building a Modern Analytics Foundation

SQL Server · ETL Pipelines · Medallion Architecture · Star Schema Modeling

A hands-on data engineering project building a complete data warehouse from scratch using SQL Server and the Medallion Architecture (Bronze, Silver, Gold layers). This project demonstrates end-to-end data pipeline development, from raw data ingestion to business-ready analytics models.

Why I Built This Project

As a data analyst, I usually work with data that has already been cleaned, transformed, and modeled. While this allows me to focus on insights, it also means that much of the upstream logic—how data arrives in its final form and why certain decisions are made—can remain hidden.

To close this gap, I built a complete end-to-end data warehouse by following the Data With Baraa tutorial. The goal was not to become a data engineer, but to gain a deeper understanding of the processes that make reliable analytics possible.

What I Learned

Data Lineage & Business Logic

I learned how to trace data from source systems (ERP/CRM) through multiple transformation layers, understanding which business rules are applied and how they impact analytical results. This helps me better assess data reliability and explain metric behavior with proper context.

Data Quality Awareness

Working directly with raw data exposed common data quality issues such as duplicates, missing values, and inconsistencies. I now understand why these issues occur, how to detect them early, and which fields require validation before analysis.

Data Modeling for Analytics

By designing fact and dimension tables myself, I gained practical insight into surrogate keys, schema design trade-offs, and performance considerations. This allows me to write more efficient queries and suggest modeling improvements aligned with analytical use cases.

Collaboration with Data Engineers

I can now discuss ETL logic, incremental vs. full loads, pipeline performance, and integration challenges using the correct technical language, enabling more effective collaboration and clearer requirements.

Stronger Analytical Foundations

I learned how to structure reusable transformations in the Silver layer and build analyst-friendly views in the Gold layer, reducing repetitive fixes and improving consistency across reports and dashboards.

Key Takeaway

This project strengthened my analytical skill set by giving me full visibility into how data is prepared. As a result, I can:

Ask better analytical questions
Validate insights more critically
Communicate findings with clear data context
Collaborate more effectively with data engineers
Build more robust and maintainable analyses

🏗️ Project Architecture

This project follows the Medallion Architecture with three distinct layers:

Bronze Layer (Raw Data Ingestion)

Purpose: Store raw, unprocessed data exactly as received from source systems
Source Systems: ERP (sales transactions) and CRM (customer information)
Format: CSV files loaded via BULK INSERT
Key Learning: Implemented full-load ETL with TRUNCATE TABLE + BULK INSERT, error handling with TRY/CATCH, and performance measurement using DATEDIFF()

Silver Layer (Cleaned & Standardized)

Purpose: Apply data quality transformations without changing the data model
Transformations Applied:
- Deduplication: Used ROW_NUMBER() OVER (PARTITION BY...) to keep only the most recent records
- Standardization: Replaced codes (F/M) with full descriptions (Female/Male) using CASE WHEN
- Null Handling: Replaced missing values with business-friendly defaults ('Not Available', 0)
- Date Enrichment: Used LEAD() window function to calculate end dates for historical records
- Type Casting: Converted integer date fields to proper DATE types
- Business Rules: Recalculated incorrect metrics (Sales = Quantity × Price) using ABS() and NULLIF()
Key Learning: Gained hands-on experience with data quality patterns I frequently encounter in analysis—now I understand how they should be fixed at the source

Gold Layer (Business-Ready Analytics)

Purpose: Provide consumption-ready data modeled for analytical queries
Data Model: Star schema with fact and dimension tables
Implementation: SQL views (not materialized tables) for flexibility
Key Components:
- Surrogate Keys: Generated using ROW_NUMBER() for dimension primary keys
- Data Integration: Combined CRM + ERP data using LEFT JOIN and COALESCE() to pick the best source
- Fact Table: fact_sales with foreign keys to all dimensions
- Dimensions: dim_customers, dim_products, dim_stores, dim_date
Key Learning: Understood why star schemas are optimal for analytics (simplified joins, clear business context, performance)

🛠️ Technical Skills Demonstrated

SQL Server & T-SQL

DDL (Data Definition Language): Created schemas, tables, and views with proper naming conventions (snake_case)
DML (Data Manipulation Language): TRUNCATE, INSERT INTO...SELECT, BULK INSERT
Window Functions: ROW_NUMBER(), LEAD(), PARTITION BY, ORDER BY
Joins: LEFT JOIN, INNER JOIN for multi-source data integration
Functions: TRIM(), SUBSTRING(), REPLACE(), CAST(), COALESCE(), ABS(), NULLIF(), GETDATE(), DATEDIFF()
Stored Procedures: Encapsulated ETL logic in reusable procedures (bronze.load_bronze, silver.load_silver)
Error Handling: BEGIN TRY...END TRY / BEGIN CATCH...END CATCH blocks
Performance Measurement: Calculated pipeline execution time for optimization

Data Engineering Concepts

ETL Pipelines: Extract, Transform, Load workflows with full-load strategy
Data Quality: Deduplication, null handling, standardization, validation
Data Integration: Combining multiple source systems into unified dimensions
Data Modeling: Star schema design with fact and dimension tables
Surrogate Keys: Generated non-business keys for dimensional integrity
Data Lineage: Tracked data flow from Bronze → Silver → Gold with metadata columns (DW_create_date)

Best Practices

Naming Conventions: Consistent snake_case for all database objects
Schema Organization: Logical separation using Bronze/Silver/Gold schemas
Version Control: Git repository with structured folders (scripts, docs, tests, datasets)
Documentation: Data architecture diagrams, data flow diagrams, data catalog
Idempotency: Scripts check for object existence and recreate cleanly

🎓 Key Takeaways

What This Project Taught Me:

Data Quality is Foundational: 80% of analytics challenges stem from data quality issues—deduplication, null handling, and standardization should happen at the source, not in every analysis query
Context Matters More Than Code: Understanding why data is structured a certain way (e.g., why surrogate keys exist, why dates are enriched with LEAD()) is more valuable than just knowing how to write the SQL
Collaboration Requires Shared Language: Speaking the same technical language as data engineers (ETL, pipeline, surrogate keys, data lineage) makes cross-functional work exponentially more effective
Upstream Thinking Improves Analysis: When I encounter unexpected results in analysis, I now instinctively think: "Was this handled in Silver? Should it be? Or is this a legitimate business pattern?"
Documentation is Not Optional: The data catalog and architecture diagrams were essential for understanding the project—and they're just as essential in production environments

🙏 Acknowledgments

This project was built following the excellent tutorial series by Data With Baraa. All credit for the instructional content, project structure, and teaching methodology goes to Baraa Khatib Salkini.

Original Tutorial:

Course: SQL Data Warehouse Project
Materials: Download
GitHub: Original Repo

This README and repository represent my personal implementation and learning journey through the tutorial, documented for portfolio and educational purposes.

📫 Connect With Me

📝 License

This project is for educational and portfolio purposes. The original tutorial and project structure are created by Data With Baraa and are used here with proper attribution for learning purposes.

This project is part of my learning journey to become a more well-rounded data analyst by understanding the full data lifecycle—from raw ingestion to analytics-ready models.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
dataset		dataset
docs		docs
script		script
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📊 Data Warehouse Project: Building a Modern Analytics Foundation

Why I Built This Project

What I Learned

Data Lineage & Business Logic

Data Quality Awareness

Data Modeling for Analytics

Collaboration with Data Engineers

Stronger Analytical Foundations

Key Takeaway

🏗️ Project Architecture

Bronze Layer (Raw Data Ingestion)

Silver Layer (Cleaned & Standardized)

Gold Layer (Business-Ready Analytics)

🛠️ Technical Skills Demonstrated

SQL Server & T-SQL

Data Engineering Concepts

Best Practices

🎓 Key Takeaways

What This Project Taught Me:

🙏 Acknowledgments

📫 Connect With Me

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📊 Data Warehouse Project: Building a Modern Analytics Foundation

Why I Built This Project

What I Learned

Data Lineage & Business Logic

Data Quality Awareness

Data Modeling for Analytics

Collaboration with Data Engineers

Stronger Analytical Foundations

Key Takeaway

🏗️ Project Architecture

Bronze Layer (Raw Data Ingestion)

Silver Layer (Cleaned & Standardized)

Gold Layer (Business-Ready Analytics)

🛠️ Technical Skills Demonstrated

SQL Server & T-SQL

Data Engineering Concepts

Best Practices

🎓 Key Takeaways

What This Project Taught Me:

🙏 Acknowledgments

📫 Connect With Me

📝 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages