Skip to content

DataCleanAI is a local, AI-assisted data quality toolkit. It analyzes tabular datasets for common issues (missing values, outliers, duplicates, type inconsistencies) and applies automated cleaning so the data is ready for analysis.

License

Notifications You must be signed in to change notification settings

AttiR/DataCleanAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DataCleanAI - AI-Powered Data Quality System

Python FastAPI React TypeScript License

An advanced AI-powered system that automatically detects and rectifies common data quality issues, making datasets ready for analysis or modeling.

Demo Video

See DataCleanAI in action:

Watch the demo

Demo video is in demo/

Project Architecture

This project follows clean software architecture principles with clear separation of concerns:

DataCleanAI/
β”œβ”€β”€ πŸ“ backend/                    # Python FastAPI backend
β”‚   β”œβ”€β”€ πŸ“ app/                   # Application core
β”‚   β”‚   β”œβ”€β”€ πŸ“ api/              # API routes and endpoints
β”‚   β”‚   β”œβ”€β”€ πŸ“ core/             # Configuration and utilities
β”‚   β”‚   β”œβ”€β”€ πŸ“ models/           # Database models
β”‚   β”‚   β”œβ”€β”€ πŸ“ services/         # Business logic layer
β”‚   β”‚   └── πŸ“ ml/               # Machine learning components
β”‚   β”œβ”€β”€ πŸ“ database/             # Database files
β”‚   β”œβ”€β”€ πŸ“ static/               # Static files (if any)
β”‚   └── πŸ“ storage/              # File storage (uploads, models, logs)
β”œβ”€β”€ πŸ“ frontend/                 # React TypeScript frontend
β”‚   β”œβ”€β”€ πŸ“ public/               # Static assets
β”‚   └── πŸ“ src/                  # Source code
β”‚       β”œβ”€β”€ πŸ“ components/       # Reusable UI components
β”‚       β”œβ”€β”€ πŸ“ pages/            # Page components
β”‚       └── πŸ“ services/         # API communication
β”œβ”€β”€ πŸ“ tests/                    # Test files for backend
β”œβ”€β”€ πŸ“ docs/                     # Documentation
β”œβ”€β”€ πŸ“ config/                   # Configuration files
β”œβ”€β”€ πŸ“ scripts/                  # Utility and start scripts (e.g., start_backend.sh, start_frontend.sh)
β”œβ”€β”€ πŸ“ examples/                 # Sample datasets for testing
β”œβ”€β”€ πŸ“„ LICENSE                   # License file
└── πŸ“„ README.md                 # Project overview

Quick Start

Prerequisites

  • Python 3.9+ (Python 3.12 recommended)
  • Node.js 16+
  • npm or yarn

Installation

Option 1: Automated Setup (Recommended)

Use the professional setup script to install all dependencies and initialize the project:

git clone <repository-url>
cd DataCleanAI
chmod +x scripts/setup.sh
./scripts/setup.sh

This script will:

  • Check prerequisites (Python, Node.js, npm)
  • Set up the Python virtual environment
  • Install backend and frontend dependencies
  • Initialize the database and directories
  • Run initial tests and set up development tools

After setup, start the backend and frontend:

./scripts/start_backend.sh
./scripts/start_frontend.sh

Run setup script

./scripts/install_dependencies.sh


#### Option 2: Manual Setup
```bash
# Backend setup
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install fastapi uvicorn sqlalchemy pandas numpy scipy scikit-learn missingno plotly pydantic-settings python-multipart

# Frontend setup
cd frontend
npm install

Environment variables

# Generate a SECRET_KEY for backend/.env (run once)
python - <<'PY'
import secrets, pathlib
pathlib.Path('backend/.env').write_text('SECRET_KEY='+secrets.token_urlsafe(32)+'\n')
print('Wrote backend/.env (SECRET_KEY set)')
PY

Running the Application

Start Backend

source venv/bin/activate
cd backend
uvicorn app.main:app --reload --host 127.0.0.1 --port 8000

Start Frontend

cd frontend
npm start

Access Points

πŸ“Š Features

Data Analysis & Diagnosis

  • Missing Values Detection: Advanced algorithms to identify missing data patterns
  • Outlier Detection: Multiple statistical and ML-based methods
  • Duplicate Detection: Intelligent duplicate identification
  • Data Type Analysis: Automatic detection of format inconsistencies
  • Distribution Analysis: Statistical analysis and skewness detection

Data Cleaning & Transformation

  • Smart Imputation: Automatic selection of best imputation strategy
  • Outlier Treatment: Remove, cap, or transform based on context
  • Feature Scaling: Min-Max, Standard, Robust scaling
  • Categorical Encoding: One-Hot, Label, Target encoding
  • Date/Time Processing: Automatic standardization

Machine Learning Pipeline

  • AutoML Integration: Automated feature engineering
  • Model Selection: Automatic algorithm selection
  • Hyperparameter Tuning: Bayesian optimization
  • Cross-Validation: Robust model evaluation

Testing

Run Tests

# Run all tests
python -m pytest tests/

# Run specific test
python tests/test_data_quality.py

# Test with sample data
python tests/test_core_functionality.py

Test with Sample Data

# Analyze sample datasets
python tests/unit/test_data_quality.py

# Test core functionality
python tests/unit/test_core_functionality.py

Development

Project Structure Guidelines

  • backend/app/: Follow FastAPI best practices
  • frontend/src/: React components with TypeScript
  • tests/: Comprehensive test coverage
  • docs/: Keep documentation updated
  • config/: Environment-specific configurations

Code Quality

  • Linting: ESLint for frontend, Black for backend
  • Type Checking: TypeScript for frontend, mypy for backend
  • Testing: Jest for frontend, pytest for backend
  • Documentation: Comprehensive docstrings and README files

Deployment

Simple Deployment

# Run backend
cd backend && uvicorn app.main:app --host 0.0.0.0 --port 8000

# Run frontend (in new terminal)
cd frontend && npm start

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Follow the coding standards
  4. Add tests for new functionality
  5. Update documentation
  6. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

Built with ❀️ for data scientists and analysts

About

DataCleanAI is a local, AI-assisted data quality toolkit. It analyzes tabular datasets for common issues (missing values, outliers, duplicates, type inconsistencies) and applies automated cleaning so the data is ready for analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published