Skip to content

Vincds/sql-data-warehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 

Repository files navigation

📊 Data Warehouse Project: Building a Modern Analytics Foundation

SQL Server · ETL Pipelines · Medallion Architecture · Star Schema Modeling

A hands-on data engineering project building a complete data warehouse from scratch using SQL Server and the Medallion Architecture (Bronze, Silver, Gold layers). This project demonstrates end-to-end data pipeline development, from raw data ingestion to business-ready analytics models.


Why I Built This Project

As a data analyst, I usually work with data that has already been cleaned, transformed, and modeled. While this allows me to focus on insights, it also means that much of the upstream logic—how data arrives in its final form and why certain decisions are made—can remain hidden.

To close this gap, I built a complete end-to-end data warehouse by following the Data With Baraa tutorial. The goal was not to become a data engineer, but to gain a deeper understanding of the processes that make reliable analytics possible.

What I Learned

Data Lineage & Business Logic

I learned how to trace data from source systems (ERP/CRM) through multiple transformation layers, understanding which business rules are applied and how they impact analytical results. This helps me better assess data reliability and explain metric behavior with proper context.

Data Quality Awareness

Working directly with raw data exposed common data quality issues such as duplicates, missing values, and inconsistencies. I now understand why these issues occur, how to detect them early, and which fields require validation before analysis.

Data Modeling for Analytics

By designing fact and dimension tables myself, I gained practical insight into surrogate keys, schema design trade-offs, and performance considerations. This allows me to write more efficient queries and suggest modeling improvements aligned with analytical use cases.

Collaboration with Data Engineers

I can now discuss ETL logic, incremental vs. full loads, pipeline performance, and integration challenges using the correct technical language, enabling more effective collaboration and clearer requirements.

Stronger Analytical Foundations

I learned how to structure reusable transformations in the Silver layer and build analyst-friendly views in the Gold layer, reducing repetitive fixes and improving consistency across reports and dashboards.

Key Takeaway

This project strengthened my analytical skill set by giving me full visibility into how data is prepared. As a result, I can:

  • Ask better analytical questions
  • Validate insights more critically
  • Communicate findings with clear data context
  • Collaborate more effectively with data engineers
  • Build more robust and maintainable analyses

🏗️ Project Architecture

This project follows the Medallion Architecture with three distinct layers:

Data Architecture

Bronze Layer (Raw Data Ingestion)

  • Purpose: Store raw, unprocessed data exactly as received from source systems
  • Source Systems: ERP (sales transactions) and CRM (customer information)
  • Format: CSV files loaded via BULK INSERT
  • Key Learning: Implemented full-load ETL with TRUNCATE TABLE + BULK INSERT, error handling with TRY/CATCH, and performance measurement using DATEDIFF()

Silver Layer (Cleaned & Standardized)

  • Purpose: Apply data quality transformations without changing the data model
  • Transformations Applied:
    • Deduplication: Used ROW_NUMBER() OVER (PARTITION BY...) to keep only the most recent records
    • Standardization: Replaced codes (F/M) with full descriptions (Female/Male) using CASE WHEN
    • Null Handling: Replaced missing values with business-friendly defaults ('Not Available', 0)
    • Date Enrichment: Used LEAD() window function to calculate end dates for historical records
    • Type Casting: Converted integer date fields to proper DATE types
    • Business Rules: Recalculated incorrect metrics (Sales = Quantity × Price) using ABS() and NULLIF()
  • Key Learning: Gained hands-on experience with data quality patterns I frequently encounter in analysis—now I understand how they should be fixed at the source

Gold Layer (Business-Ready Analytics)

  • Purpose: Provide consumption-ready data modeled for analytical queries
  • Data Model: Star schema with fact and dimension tables
  • Implementation: SQL views (not materialized tables) for flexibility
  • Key Components:
    • Surrogate Keys: Generated using ROW_NUMBER() for dimension primary keys
    • Data Integration: Combined CRM + ERP data using LEFT JOIN and COALESCE() to pick the best source
    • Fact Table: fact_sales with foreign keys to all dimensions
    • Dimensions: dim_customers, dim_products, dim_stores, dim_date
  • Key Learning: Understood why star schemas are optimal for analytics (simplified joins, clear business context, performance)

🛠️ Technical Skills Demonstrated

SQL Server & T-SQL

  • DDL (Data Definition Language): Created schemas, tables, and views with proper naming conventions (snake_case)
  • DML (Data Manipulation Language): TRUNCATE, INSERT INTO...SELECT, BULK INSERT
  • Window Functions: ROW_NUMBER(), LEAD(), PARTITION BY, ORDER BY
  • Joins: LEFT JOIN, INNER JOIN for multi-source data integration
  • Functions: TRIM(), SUBSTRING(), REPLACE(), CAST(), COALESCE(), ABS(), NULLIF(), GETDATE(), DATEDIFF()
  • Stored Procedures: Encapsulated ETL logic in reusable procedures (bronze.load_bronze, silver.load_silver)
  • Error Handling: BEGIN TRY...END TRY / BEGIN CATCH...END CATCH blocks
  • Performance Measurement: Calculated pipeline execution time for optimization

Data Engineering Concepts

  • ETL Pipelines: Extract, Transform, Load workflows with full-load strategy
  • Data Quality: Deduplication, null handling, standardization, validation
  • Data Integration: Combining multiple source systems into unified dimensions
  • Data Modeling: Star schema design with fact and dimension tables
  • Surrogate Keys: Generated non-business keys for dimensional integrity
  • Data Lineage: Tracked data flow from Bronze → Silver → Gold with metadata columns (DW_create_date)

Best Practices

  • Naming Conventions: Consistent snake_case for all database objects
  • Schema Organization: Logical separation using Bronze/Silver/Gold schemas
  • Version Control: Git repository with structured folders (scripts, docs, tests, datasets)
  • Documentation: Data architecture diagrams, data flow diagrams, data catalog
  • Idempotency: Scripts check for object existence and recreate cleanly

🎓 Key Takeaways

What This Project Taught Me:

  1. Data Quality is Foundational: 80% of analytics challenges stem from data quality issues—deduplication, null handling, and standardization should happen at the source, not in every analysis query

  2. Context Matters More Than Code: Understanding why data is structured a certain way (e.g., why surrogate keys exist, why dates are enriched with LEAD()) is more valuable than just knowing how to write the SQL

  3. Collaboration Requires Shared Language: Speaking the same technical language as data engineers (ETL, pipeline, surrogate keys, data lineage) makes cross-functional work exponentially more effective

  4. Upstream Thinking Improves Analysis: When I encounter unexpected results in analysis, I now instinctively think: "Was this handled in Silver? Should it be? Or is this a legitimate business pattern?"

  5. Documentation is Not Optional: The data catalog and architecture diagrams were essential for understanding the project—and they're just as essential in production environments


🙏 Acknowledgments

This project was built following the excellent tutorial series by Data With Baraa. All credit for the instructional content, project structure, and teaching methodology goes to Baraa Khatib Salkini.

Original Tutorial:

This README and repository represent my personal implementation and learning journey through the tutorial, documented for portfolio and educational purposes.


📫 Connect With Me

LinkedIn Email GitHub


📝 License

This project is for educational and portfolio purposes. The original tutorial and project structure are created by Data With Baraa and are used here with proper attribution for learning purposes.


This project is part of my learning journey to become a more well-rounded data analyst by understanding the full data lifecycle—from raw ingestion to analytics-ready models.

About

Learning project: building a datawarehouse with SQL Server, including ETL processes, data modeling and data analytics

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages