Skip to content

ghfranj/SmartDataFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SmartDataFlow: Scalable Clustering & Data Mart on Spark

Overview SmartDataFlow is an end-to-end data processing and clustering pipeline built using Apache Spark, Scala, Python, and Kubernetes. The project integrates a scalable clustering model with a unified data mart and an Oracle database, providing seamless data retrieval, pre-processing, and analytics.

This pipeline demonstrates how modern big data tools can be combined for efficient, modular, and production-ready data solutions.


Key Features

  • Data Clustering: K-Means clustering implemented on large datasets using PySpark.
  • Unified Data Mart: Scala-based service offering a standardized protocol for querying and preprocessing data.
  • Database Integration: Oracle 21c as the main data source (extensible to PostgreSQL, MySQL, HBase, etc.).
  • Containerized Deployment: Docker and Kubernetes for isolated, scalable environments.
  • End-to-End Workflow: Model → Data Mart → Database, without direct coupling between components.

Architecture

15a6395f-fbb1-4b6d-9040-6c3e919a3d66
  • Oracle DB: Stores source data and serves as a central repository.
  • Data Mart: Receives model requests, pre-processes data, and interacts with the database.
  • Spark Model: Runs clustering tasks on the pre-processed data and returns results.

Tech Stack

  • Data Processing: PySpark, Scala
  • Database: Oracle 21c
  • Containerization: Docker
  • Orchestration: Kubernetes
  • Programming Languages: Python, Scala

Project Structure

/src        # PySpark model code
/datamart   # Scala-based data mart service
/kubernetes # Kubernetes deployment manifests
/docker-compose.yml # Orchestrates DB, model, and data mart

Getting Started

  1. Clone the repository
git clone https://github.com/ghfranj/SmartDataFlow.git
cd SmartDataFlow
  1. Deploy all services to Kubernetes
kubectl apply -f kubernetes/
  1. Verify deployments
kubectl get deployments
  1. Check services
kubectl get services
  1. Check pods
kubectl get pods
  1. Access Spark UI
http://localhost:4040
  1. Query Data Mart API
http://localhost:9000

Usage Example

Once the services are running:

  1. Send a query to the data mart API to request data for clustering.
  2. The Spark model processes the data and returns clustering results.
  3. Results are stored back or can be retrieved via the data mart service.

This workflow ensures clean separation of responsibilities between the database, data mart, and model.


Screenshots of Running Services

1. Deploy Services to Kubernetes (Command: kubectl apply -f kubernetes/*.yml) Deploy Services to Kubernetes

2. Deployments Running (Command: kubectl get deployments) deployments running all

3. Services Running (Command: kubectl get services) services with helm

4. Pods Running (Command: kubectl get pods) pods running all together

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors