I’m Hrishikesh Gawde, a recent Computer Science graduate with a strong interest in data engineering. Since college, I’ve been curious about distributed systems after coming across a YouTube playlist by MIT OpenCourseWare on the topic. While exploring career paths related to distributed systems, I discovered data engineering and became interested in how data moves and transforms across systems. This interest led me to build multiple end-to-end data engineering projects that reflect real industry practices.
I’ve worked on projects that covers both batch and streaming data pipelines, touching on use cases like CDC ingestion, SCD2 merges, event-driven loads and lakehouse architectures like the Medallion Architecture. I’ve also focused on testing, automation, and workflow orchestration. These projects are built with modern tools and frameworks commonly used in the data industry.
My goal was to go beyond basic tutorials and build projects that show an understanding of data engineering workflows in different business domains. I wanted to cover multiple industry use cases, explore modern tools and create pipelines that include monitoring, testing, and automation layers. The projects cover batch and streaming processing, distributed computing, data warehousing, real-time analytics, CDC ingestion, SCD2 merges, data lakehouse architectures, and workflow orchestration. They span cloud platforms like AWS, GCP, and Databricks, and make use of open table formats like Iceberg, Hudi, and Delta Lake.
Each project reflects a specific learning goal: working with streaming data, handling change data capture, performing SCD2 merges or implementing data lakehouse patterns.
Below are my data engineering projects. Each project has its own repository with source code and documentation.
✅ 1. Flight Booking Data Pipeline with Airflow and CICD
Tech Stack: GitHub, GitHub Actions, Google Storage, PySpark, Dataproc Serverless, Airflow, BigQuery
✅ 2. Event Driven Incremental Ingestion Pipeline for Order Tracking
Tech Stack: Google Storage, PySpark, Databricks, Delta Lake, Databricks Workflows, GitHub
✅ 3. UPI Transactions Real Time CDC Feed Processing
Tech Stack: Databricks, Spark Structured Streaming, Delta Lake
✅ 4. Travel Bookings Data Ingestion Pipeline With SCD2 Merge
Tech Stack: Databricks, PySpark, Delta Lake, Delta Live Table Job
✅ 5. Healthcare Delta Live Table Pipeline with Medallion Architecture
Tech Stack: Databricks, PySpark, Delta Lake, Delta Live Table Job
✅ 6. News Data Analysis with Event-Driven Incremental Load in Snowflake Table
Tech Stack: Airflow, Google Cloud Storage, Python, Snowflake
✅ 7. Movie Bookings Real Time CDC Data Pipeline with Medallion Architecture in Snowflake
Tech Stack: Python, Snowflake Dynamic Table, Snowflake Stream, Snowflake Tasks, Streamlit
✅ 8. Car Rental Data Batch Ingestion with SCD2 Merge in Snowflake Table
Tech Stack: Python, PySpark, GCP Dataproc, Airflow, Snowflake
✅ 9. IRCTC Streaming Data Ingestion into BigQuery
Tech Stack: Python, GCP Storage, GCP Pub-Sub, BigQuery, Dataflow
✅ 10. Walmart Data Ingestion in BigQuery
Tech Stack: Python, Airflow, GCP Storage, BigQuery
✅ 11. Ad Tech Real Time Data Analysis
Tech Stack: Python, AWS Kinesis, AWS Managed Flink, AWS Glue, Spark Streaming, Apache Iceberg, AWS S3, Glue Catalog, AWS Athena
✅ 12. Credit Card Transaction Analysis for Fraud Risk
Tech Stack: Python, PySpark, Google Storage, GCP Dataproc Serverless, GCP BigQuery, GCP Composer (Airflow), PyTest, GitHub, GitHub Actions
Feel free to explore my project repositories. Each one includes source code and detailed documentation. For any opportunities or discussions related to data engineering roles, you can reach me at:
📍 Mumbai, Maharashtra
📞 +91 9309268556