Skip to content
View hrishithub's full-sized avatar
🎯
Focusing
🎯
Focusing

Block or report hrishithub

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
hrishithub/README.md

🧑‍💻 About Me

I’m Hrishikesh Gawde, a recent Computer Science graduate with a strong interest in data engineering. Since college, I’ve been curious about distributed systems after coming across a YouTube playlist by MIT OpenCourseWare on the topic. While exploring career paths related to distributed systems, I discovered data engineering and became interested in how data moves and transforms across systems. This interest led me to build multiple end-to-end data engineering projects that reflect real industry practices.

I’ve worked on projects that covers both batch and streaming data pipelines, touching on use cases like CDC ingestion, SCD2 merges, event-driven loads and lakehouse architectures like the Medallion Architecture. I’ve also focused on testing, automation, and workflow orchestration. These projects are built with modern tools and frameworks commonly used in the data industry.

🎯 Why I Built These Projects

My goal was to go beyond basic tutorials and build projects that show an understanding of data engineering workflows in different business domains. I wanted to cover multiple industry use cases, explore modern tools and create pipelines that include monitoring, testing, and automation layers. The projects cover batch and streaming processing, distributed computing, data warehousing, real-time analytics, CDC ingestion, SCD2 merges, data lakehouse architectures, and workflow orchestration. They span cloud platforms like AWS, GCP, and Databricks, and make use of open table formats like Iceberg, Hudi, and Delta Lake.

Each project reflects a specific learning goal: working with streaming data, handling change data capture, performing SCD2 merges or implementing data lakehouse patterns.

📂 Projects

Below are my data engineering projects. Each project has its own repository with source code and documentation.

✅ 1. Flight Booking Data Pipeline with Airflow and CICD

Tech Stack: GitHub, GitHub Actions, Google Storage, PySpark, Dataproc Serverless, Airflow, BigQuery

✅ 2. Event Driven Incremental Ingestion Pipeline for Order Tracking

Tech Stack: Google Storage, PySpark, Databricks, Delta Lake, Databricks Workflows, GitHub

✅ 3. UPI Transactions Real Time CDC Feed Processing

Tech Stack: Databricks, Spark Structured Streaming, Delta Lake

✅ 4. Travel Bookings Data Ingestion Pipeline With SCD2 Merge

Tech Stack: Databricks, PySpark, Delta Lake, Delta Live Table Job

✅ 5. Healthcare Delta Live Table Pipeline with Medallion Architecture

Tech Stack: Databricks, PySpark, Delta Lake, Delta Live Table Job

✅ 6. News Data Analysis with Event-Driven Incremental Load in Snowflake Table

Tech Stack: Airflow, Google Cloud Storage, Python, Snowflake

✅ 7. Movie Bookings Real Time CDC Data Pipeline with Medallion Architecture in Snowflake

Tech Stack: Python, Snowflake Dynamic Table, Snowflake Stream, Snowflake Tasks, Streamlit

✅ 8. Car Rental Data Batch Ingestion with SCD2 Merge in Snowflake Table

Tech Stack: Python, PySpark, GCP Dataproc, Airflow, Snowflake

✅ 9. IRCTC Streaming Data Ingestion into BigQuery

Tech Stack: Python, GCP Storage, GCP Pub-Sub, BigQuery, Dataflow

✅ 10. Walmart Data Ingestion in BigQuery

Tech Stack: Python, Airflow, GCP Storage, BigQuery

✅ 11. Ad Tech Real Time Data Analysis

Tech Stack: Python, AWS Kinesis, AWS Managed Flink, AWS Glue, Spark Streaming, Apache Iceberg, AWS S3, Glue Catalog, AWS Athena

✅ 12. Credit Card Transaction Analysis for Fraud Risk

Tech Stack: Python, PySpark, Google Storage, GCP Dataproc Serverless, GCP BigQuery, GCP Composer (Airflow), PyTest, GitHub, GitHub Actions

Feel free to explore my project repositories. Each one includes source code and detailed documentation. For any opportunities or discussions related to data engineering roles, you can reach me at:

📍 Mumbai, Maharashtra

📞 +91 9309268556

📧 hrishikesh.workmail@gmail.com

🔗 LinkedIn | GitHub

Pinned Loading

  1. Ad-Tech-Real-Time-Data-Analysis Ad-Tech-Real-Time-Data-Analysis Public

    Real-time ad analytics pipeline using AWS Kinesis, Apache Flink, and Iceberg. Processes ad impressions and clicks, performs stream joins, enriches data, and enables querying via Athena for campaign…

    Python

  2. Car-Rental-Data-Batch-Ingestion-with-SCD2-Merge-in-Snowflake Car-Rental-Data-Batch-Ingestion-with-SCD2-Merge-in-Snowflake Public

    Batch data pipeline for car rentals using GCP and Snowflake. Ingests daily data from GCS, processes with PySpark on Dataproc, applies SCD2 merge for customer dimension, and orchestrates the workflo…

    Python

  3. Credit-Card-Transaction-Analysis-for-Fraud-Risk Credit-Card-Transaction-Analysis-for-Fraud-Risk Public

    This project demonstrates a modern, automated ETL pipeline built to analyze credit card transactions for fraud detection and risk scoring. It leverages Google Cloud Platform (GCP) services, PySpark…

    Python 1

  4. Movie-Booking-Real-Time-CDC-Pipeline-with-Medallion-Architecture-in-Snowflake Movie-Booking-Real-Time-CDC-Pipeline-with-Medallion-Architecture-in-Snowflake Public

    Built a real-time CDC pipeline for movie bookings using Snowflake Streams, Dynamic Tables, and Tasks with Medallion Architecture. Final insights are delivered via a Streamlit dashboard for web-base…

    Python