Run setup script

DataCleanAI - AI-Powered Data Quality System

An advanced AI-powered system that automatically detects and rectifies common data quality issues, making datasets ready for analysis or modeling.

Demo Video

See DataCleanAI in action:

Demo video is in demo/

Project Architecture

This project follows clean software architecture principles with clear separation of concerns:

DataCleanAI/
├── 📁 backend/                    # Python FastAPI backend
│   ├── 📁 app/                   # Application core
│   │   ├── 📁 api/              # API routes and endpoints
│   │   ├── 📁 core/             # Configuration and utilities
│   │   ├── 📁 models/           # Database models
│   │   ├── 📁 services/         # Business logic layer
│   │   └── 📁 ml/               # Machine learning components
│   ├── 📁 database/             # Database files
│   ├── 📁 static/               # Static files (if any)
│   └── 📁 storage/              # File storage (uploads, models, logs)
├── 📁 frontend/                 # React TypeScript frontend
│   ├── 📁 public/               # Static assets
│   └── 📁 src/                  # Source code
│       ├── 📁 components/       # Reusable UI components
│       ├── 📁 pages/            # Page components
│       └── 📁 services/         # API communication
├── 📁 tests/                    # Test files for backend
├── 📁 docs/                     # Documentation
├── 📁 config/                   # Configuration files
├── 📁 scripts/                  # Utility and start scripts (e.g., start_backend.sh, start_frontend.sh)
├── 📁 examples/                 # Sample datasets for testing
├── 📄 LICENSE                   # License file
└── 📄 README.md                 # Project overview

Quick Start

Prerequisites

Python 3.9+ (Python 3.12 recommended)
Node.js 16+
npm or yarn

Installation

Option 1: Automated Setup (Recommended)

Use the professional setup script to install all dependencies and initialize the project:

git clone <repository-url>
cd DataCleanAI
chmod +x scripts/setup.sh
./scripts/setup.sh

This script will:

Check prerequisites (Python, Node.js, npm)
Set up the Python virtual environment
Install backend and frontend dependencies
Initialize the database and directories
Run initial tests and set up development tools

After setup, start the backend and frontend:

./scripts/start_backend.sh
./scripts/start_frontend.sh

Run setup script

./scripts/install_dependencies.sh


#### Option 2: Manual Setup
```bash
# Backend setup
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install fastapi uvicorn sqlalchemy pandas numpy scipy scikit-learn missingno plotly pydantic-settings python-multipart

# Frontend setup
cd frontend
npm install

Environment variables

# Generate a SECRET_KEY for backend/.env (run once)
python - <<'PY'
import secrets, pathlib
pathlib.Path('backend/.env').write_text('SECRET_KEY='+secrets.token_urlsafe(32)+'\n')
print('Wrote backend/.env (SECRET_KEY set)')
PY

Running the Application

Start Backend

source venv/bin/activate
cd backend
uvicorn app.main:app --reload --host 127.0.0.1 --port 8000

Start Frontend

cd frontend
npm start

Access Points

Frontend: http://localhost:3000
Backend API: http://localhost:8000
API Documentation: http://localhost:8000/docs

📊 Features

Data Analysis & Diagnosis

Missing Values Detection: Advanced algorithms to identify missing data patterns
Outlier Detection: Multiple statistical and ML-based methods
Duplicate Detection: Intelligent duplicate identification
Data Type Analysis: Automatic detection of format inconsistencies
Distribution Analysis: Statistical analysis and skewness detection

Data Cleaning & Transformation

Smart Imputation: Automatic selection of best imputation strategy
Outlier Treatment: Remove, cap, or transform based on context
Feature Scaling: Min-Max, Standard, Robust scaling
Categorical Encoding: One-Hot, Label, Target encoding
Date/Time Processing: Automatic standardization

Machine Learning Pipeline

AutoML Integration: Automated feature engineering
Model Selection: Automatic algorithm selection
Hyperparameter Tuning: Bayesian optimization
Cross-Validation: Robust model evaluation

Testing

Run Tests

# Run all tests
python -m pytest tests/

# Run specific test
python tests/test_data_quality.py

# Test with sample data
python tests/test_core_functionality.py

Test with Sample Data

# Analyze sample datasets
python tests/unit/test_data_quality.py

# Test core functionality
python tests/unit/test_core_functionality.py

Development

Project Structure Guidelines

backend/app/: Follow FastAPI best practices
frontend/src/: React components with TypeScript
tests/: Comprehensive test coverage
docs/: Keep documentation updated
config/: Environment-specific configurations

Code Quality

Linting: ESLint for frontend, Black for backend
Type Checking: TypeScript for frontend, mypy for backend
Testing: Jest for frontend, pytest for backend
Documentation: Comprehensive docstrings and README files

Deployment

Simple Deployment

# Run backend
cd backend && uvicorn app.main:app --host 0.0.0.0 --port 8000

# Run frontend (in new terminal)
cd frontend && npm start

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Follow the coding standards
Add tests for new functionality
Update documentation
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Built with ❤️ for data scientists and analysts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataCleanAI - AI-Powered Data Quality System

Demo Video

Project Architecture

Quick Start

Prerequisites

Installation

Option 1: Automated Setup (Recommended)

Run setup script

Environment variables

Running the Application

Start Backend

Start Frontend

Access Points

📊 Features

Data Analysis & Diagnosis

Data Cleaning & Transformation

Machine Learning Pipeline

Testing

Run Tests

Test with Sample Data

Development

Project Structure Guidelines

Code Quality

Deployment

Simple Deployment

Contributing

📄 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
backend		backend
demo		demo
frontend		frontend
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
run_server.py		run_server.py
start_backend.sh		start_backend.sh
start_frontend.sh		start_frontend.sh
test_email.py		test_email.py

License

AttiR/DataCleanAI

Folders and files

Latest commit

History

Repository files navigation

DataCleanAI - AI-Powered Data Quality System

Demo Video

Project Architecture

Quick Start

Prerequisites

Installation

Option 1: Automated Setup (Recommended)

Run setup script

Environment variables

Running the Application

Start Backend

Start Frontend

Access Points

📊 Features

Data Analysis & Diagnosis

Data Cleaning & Transformation

Machine Learning Pipeline

Testing

Run Tests

Test with Sample Data

Development

Project Structure Guidelines

Code Quality

Deployment

Simple Deployment

Contributing

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages