Overview SmartDataFlow is an end-to-end data processing and clustering pipeline built using Apache Spark, Scala, Python, and Kubernetes. The project integrates a scalable clustering model with a unified data mart and an Oracle database, providing seamless data retrieval, pre-processing, and analytics.
This pipeline demonstrates how modern big data tools can be combined for efficient, modular, and production-ready data solutions.
- Data Clustering: K-Means clustering implemented on large datasets using PySpark.
- Unified Data Mart: Scala-based service offering a standardized protocol for querying and preprocessing data.
- Database Integration: Oracle 21c as the main data source (extensible to PostgreSQL, MySQL, HBase, etc.).
- Containerized Deployment: Docker and Kubernetes for isolated, scalable environments.
- End-to-End Workflow: Model → Data Mart → Database, without direct coupling between components.
- Oracle DB: Stores source data and serves as a central repository.
- Data Mart: Receives model requests, pre-processes data, and interacts with the database.
- Spark Model: Runs clustering tasks on the pre-processed data and returns results.
- Data Processing: PySpark, Scala
- Database: Oracle 21c
- Containerization: Docker
- Orchestration: Kubernetes
- Programming Languages: Python, Scala
/src # PySpark model code
/datamart # Scala-based data mart service
/kubernetes # Kubernetes deployment manifests
/docker-compose.yml # Orchestrates DB, model, and data mart
- Clone the repository
git clone https://github.com/ghfranj/SmartDataFlow.git
cd SmartDataFlow- Deploy all services to Kubernetes
kubectl apply -f kubernetes/- Verify deployments
kubectl get deployments- Check services
kubectl get services- Check pods
kubectl get pods- Access Spark UI
http://localhost:4040
- Query Data Mart API
http://localhost:9000
Once the services are running:
- Send a query to the data mart API to request data for clustering.
- The Spark model processes the data and returns clustering results.
- Results are stored back or can be retrieved via the data mart service.
This workflow ensures clean separation of responsibilities between the database, data mart, and model.
1. Deploy Services to Kubernetes
(Command: kubectl apply -f kubernetes/*.yml)

2. Deployments Running
(Command: kubectl get deployments)


