An advanced AI-powered system that automatically detects and rectifies common data quality issues, making datasets ready for analysis or modeling.
See DataCleanAI in action:
Demo video is in demo/
This project follows clean software architecture principles with clear separation of concerns:
DataCleanAI/
βββ π backend/ # Python FastAPI backend
β βββ π app/ # Application core
β β βββ π api/ # API routes and endpoints
β β βββ π core/ # Configuration and utilities
β β βββ π models/ # Database models
β β βββ π services/ # Business logic layer
β β βββ π ml/ # Machine learning components
β βββ π database/ # Database files
β βββ π static/ # Static files (if any)
β βββ π storage/ # File storage (uploads, models, logs)
βββ π frontend/ # React TypeScript frontend
β βββ π public/ # Static assets
β βββ π src/ # Source code
β βββ π components/ # Reusable UI components
β βββ π pages/ # Page components
β βββ π services/ # API communication
βββ π tests/ # Test files for backend
βββ π docs/ # Documentation
βββ π config/ # Configuration files
βββ π scripts/ # Utility and start scripts (e.g., start_backend.sh, start_frontend.sh)
βββ π examples/ # Sample datasets for testing
βββ π LICENSE # License file
βββ π README.md # Project overview
- Python 3.9+ (Python 3.12 recommended)
- Node.js 16+
- npm or yarn
Use the professional setup script to install all dependencies and initialize the project:
git clone <repository-url>
cd DataCleanAI
chmod +x scripts/setup.sh
./scripts/setup.shThis script will:
- Check prerequisites (Python, Node.js, npm)
- Set up the Python virtual environment
- Install backend and frontend dependencies
- Initialize the database and directories
- Run initial tests and set up development tools
After setup, start the backend and frontend:
./scripts/start_backend.sh
./scripts/start_frontend.sh./scripts/install_dependencies.sh
#### Option 2: Manual Setup
```bash
# Backend setup
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install fastapi uvicorn sqlalchemy pandas numpy scipy scikit-learn missingno plotly pydantic-settings python-multipart
# Frontend setup
cd frontend
npm install
# Generate a SECRET_KEY for backend/.env (run once)
python - <<'PY'
import secrets, pathlib
pathlib.Path('backend/.env').write_text('SECRET_KEY='+secrets.token_urlsafe(32)+'\n')
print('Wrote backend/.env (SECRET_KEY set)')
PYsource venv/bin/activate
cd backend
uvicorn app.main:app --reload --host 127.0.0.1 --port 8000cd frontend
npm start- Frontend: http://localhost:3000
- Backend API: http://localhost:8000
- API Documentation: http://localhost:8000/docs
- Missing Values Detection: Advanced algorithms to identify missing data patterns
- Outlier Detection: Multiple statistical and ML-based methods
- Duplicate Detection: Intelligent duplicate identification
- Data Type Analysis: Automatic detection of format inconsistencies
- Distribution Analysis: Statistical analysis and skewness detection
- Smart Imputation: Automatic selection of best imputation strategy
- Outlier Treatment: Remove, cap, or transform based on context
- Feature Scaling: Min-Max, Standard, Robust scaling
- Categorical Encoding: One-Hot, Label, Target encoding
- Date/Time Processing: Automatic standardization
- AutoML Integration: Automated feature engineering
- Model Selection: Automatic algorithm selection
- Hyperparameter Tuning: Bayesian optimization
- Cross-Validation: Robust model evaluation
# Run all tests
python -m pytest tests/
# Run specific test
python tests/test_data_quality.py
# Test with sample data
python tests/test_core_functionality.py# Analyze sample datasets
python tests/unit/test_data_quality.py
# Test core functionality
python tests/unit/test_core_functionality.py- backend/app/: Follow FastAPI best practices
- frontend/src/: React components with TypeScript
- tests/: Comprehensive test coverage
- docs/: Keep documentation updated
- config/: Environment-specific configurations
- Linting: ESLint for frontend, Black for backend
- Type Checking: TypeScript for frontend, mypy for backend
- Testing: Jest for frontend, pytest for backend
- Documentation: Comprehensive docstrings and README files
# Run backend
cd backend && uvicorn app.main:app --host 0.0.0.0 --port 8000
# Run frontend (in new terminal)
cd frontend && npm start- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Follow the coding standards
- Add tests for new functionality
- Update documentation
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
Built with β€οΈ for data scientists and analysts