A simple, modular framework for building data processing pipelines in Python. This framework demonstrates how to structure and orchestrate data processing workflows with reusable components.
This framework is designed to showcase:
- Pipeline Orchestration: How to organize and execute data processing steps
- Modular Components: Reusable data processing components
- Configuration-Driven Workflows: YAML-based pipeline definitions
- Simple Architecture: Easy-to-understand 3-stage processing pattern
βββββββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββ
β Pre-Process β Process β Post-Process β
βββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββ€
β β’ Data Loading β β’ Transformationβ β’ Report Gen β
β β’ Data Cleaning β β’ Analysis β β’ Data Export β
β β’ Validation β β’ Aggregation β β’ Archiving β
βββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββ
- Data Loader: Load data from various sources (CSV, JSON, Excel)
- Data Validator: Validate data quality and schema
- Data Transformer: Apply calculations, aggregations, and formatting
- Data Analyzer: Generate statistics and insights
- Report Generator: Create HTML/Markdown reports
- Data Exporter: Export results in multiple formats
pip install -r requirements.txtpython run_pipeline.py pipelines/simple_pipeline.yamlThe pipeline will create:
data/processed/- Intermediate processing filesdata/results/- Analysis reportsdata/final/- Final exported data and archive
PythonWorkflow/
βββ run_pipeline.py # Pipeline runner
βββ requirements.txt # Dependencies
βββ README.md # This file
βββ pipelines/ # Pipeline configurations
β βββ simple_pipeline.yaml # Basic processing pipeline
β βββ pipeline_sample_e2e.yaml # Legacy complex pipeline
βββ src/ # Processing components
βββ 1_stage_pre_process/ # Data loading & validation
β βββ data_loader.py # Load and clean data
β βββ data_validator.py # Validate data quality
βββ 2_stage_process/ # Data transformation
β βββ feature_engineering.py # Data transformations
β βββ data_analyzer.py # Statistical analysis
βββ 3_stage_post_process/ # Results & export
βββ simple_reporter.py # Generate reports
βββ data_exporter.py # Export to formats
# Load sample data
python src/1_stage_pre_process/data_loader.py \
--generate-sample \
--output-path data/sample.csv
# Transform the data
python src/2_stage_process/feature_engineering.py \
--input-path data/sample.csv \
--output-path data/transformed.csv \
--add-calculations \
--format-data
# Generate a report
python src/3_stage_post_process/simple_reporter.py \
--input-path data/transformed.csv \
--output-path data/report.html \
--format htmlCreate a YAML file with your workflow:
name: "My Custom Pipeline"
description: "Custom data processing workflow"
steps:
- id: "load"
component: "src/1_stage_pre_process/data_loader.py"
parameters:
--input-path: "my_data.csv"
--output-path: "data/loaded.csv"
--clean-data: true
- id: "process"
component: "src/2_stage_process/feature_engineering.py"
depends_on: ["load"]
parameters:
--input-path: "data/loaded.csv"
--output-path: "data/processed.csv"
--add-calculations: true- Load CSV, JSON, Excel files
- Generate sample datasets
- Basic data cleaning
- Remove duplicates and empty rows
- Add calculated fields (totals, dates)
- Create aggregations by category
- Apply data formatting
- Filter data by criteria
- Generate descriptive statistics
- Find data quality issues
- Identify patterns and outliers
- Create insights and recommendations
- Create HTML reports with styling
- Generate Markdown summaries
- Export analysis as JSON
- Include data statistics and insights
- Export to CSV, JSON, Excel formats
- Create compressed archives
- Generate export summaries
- Organize output files
- Modular Design: Each component is independent and reusable
- Pipeline Configuration: Define workflows in YAML files
- Multiple Formats: Support for CSV, JSON, Excel, HTML, Markdown
- Error Handling: Comprehensive logging and error reporting
- Dependency Management: Automatic step ordering and dependencies
- Flexible Parameters: Command-line configuration for all components
After running the simple pipeline:
β
Pipeline completed successfully!
π Results Summary:
- Generated sample data: 100 rows, 6 columns
- Created 8 derived fields
- Found 3 data insights
- Exported 2 formats: CSV, JSON
- Generated HTML report: data/results/report.html
## π€ Contributing
This framework is designed as a demonstration of pipeline architecture patterns. Feel free to:
- Add new processing components
- Create custom pipeline configurations
- Extend the component interfaces
- Improve error handling and logging
## οΏ½ License
MIT License - see LICENSE file for details.
---
**PythonWorkflow Framework** - Simple, modular data processing pipelines
- **Data Analysis Pipelines**: ETL processes for research data
- **Machine Learning Workflows**: From data prep to model deployment
- **Report Generation**: Automated analysis and reporting
- **Batch Processing**: Large-scale data processing jobs
- **Experimental Workflows**: Reproducible research experiments
## π Related Tools
This framework is designed to be lightweight and can be integrated with:
- **Apache Airflow** for production scheduling
- **MLflow** for experiment tracking
- **Docker** for containerized execution
- **Jupyter Notebooks** for interactive development