SmartDataFlow: Scalable Clustering & Data Mart on Spark

Overview SmartDataFlow is an end-to-end data processing and clustering pipeline built using Apache Spark, Scala, Python, and Kubernetes. The project integrates a scalable clustering model with a unified data mart and an Oracle database, providing seamless data retrieval, pre-processing, and analytics.

This pipeline demonstrates how modern big data tools can be combined for efficient, modular, and production-ready data solutions.

Key Features

Data Clustering: K-Means clustering implemented on large datasets using PySpark.
Unified Data Mart: Scala-based service offering a standardized protocol for querying and preprocessing data.
Database Integration: Oracle 21c as the main data source (extensible to PostgreSQL, MySQL, HBase, etc.).
Containerized Deployment: Docker and Kubernetes for isolated, scalable environments.
End-to-End Workflow: Model → Data Mart → Database, without direct coupling between components.

Architecture

Oracle DB: Stores source data and serves as a central repository.
Data Mart: Receives model requests, pre-processes data, and interacts with the database.
Spark Model: Runs clustering tasks on the pre-processed data and returns results.

Tech Stack

Data Processing: PySpark, Scala
Database: Oracle 21c
Containerization: Docker
Orchestration: Kubernetes
Programming Languages: Python, Scala

Project Structure

/src        # PySpark model code
/datamart   # Scala-based data mart service
/kubernetes # Kubernetes deployment manifests
/docker-compose.yml # Orchestrates DB, model, and data mart

Getting Started

Clone the repository

git clone https://github.com/ghfranj/SmartDataFlow.git
cd SmartDataFlow

Deploy all services to Kubernetes

kubectl apply -f kubernetes/

Verify deployments

kubectl get deployments

Check services

kubectl get services

Check pods

kubectl get pods

Access Spark UI

http://localhost:4040

Query Data Mart API

http://localhost:9000

Usage Example

Once the services are running:

Send a query to the data mart API to request data for clustering.
The Spark model processes the data and returns clustering results.
Results are stored back or can be retrieved via the data mart service.

This workflow ensures clean separation of responsibilities between the database, data mart, and model.

Screenshots of Running Services

1. Deploy Services to Kubernetes (Command: kubectl apply -f kubernetes/*.yml)

2. Deployments Running (Command: kubectl get deployments)

3. Services Running (Command: kubectl get services)

4. Pods Running (Command: kubectl get pods)

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.idea		.idea
datamart		datamart
kubernetes		kubernetes
src		src
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SmartDataFlow: Scalable Clustering & Data Mart on Spark

Key Features

Architecture

Tech Stack

Project Structure

Getting Started

Usage Example

Screenshots of Running Services

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SmartDataFlow: Scalable Clustering & Data Mart on Spark

Key Features

Architecture

Tech Stack

Project Structure

Getting Started

Usage Example

Screenshots of Running Services

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages