feat: add QualityScorer module for automatic data quality assessment (#34) by AliiiBenn · Pull Request #46 · wareflowx/excel-to-sql

AliiiBenn · 2026-01-23T14:18:01Z

Summary

Implements the QualityScorer module for automatic data quality assessment of pandas DataFrames
Provides comprehensive quality reports with scores, grades, and actionable recommendations
Detects common data quality issues: null values, duplicates, empty columns, outliers
Includes 29 comprehensive tests with 99% code coverage

Implementation Details

QualityScorer Class (`excel_to_sql/auto_pilot/quality.py`)

Core Features:

Quality Score (0-100): Starts at 100, deducts points for:
- Null values above threshold (default 10%)
- Duplicates in potential primary key columns (2 points each)
- Empty columns (5 points each)
- Statistical outliers (0.1 points each, capped at 10 total)
Letter Grade Assignment:
- A: 90-100 (excellent quality)
- B: 75-89 (good quality)
- C: 60-74 (acceptable quality)
- D: 1-59 (poor quality)
- F: 0 (unusable)
Issue Detection:
- Null value percentage per column
- Duplicate values in potential PK columns
- Empty columns (100% null)
- Type mismatches (numeric data stored as object)
- Statistical outliers using 3-sigma rule
Per-Column Statistics:
- Data type, null count/percentage
- Unique count/percentage
- Sample values (top 5)
- Primary key potential (≥95% unique)
- Empty column flag

API Usage:

from excel_to_sql.auto_pilot.quality import QualityScorer

scorer = QualityScorer()
report = scorer.generate_quality_report(df, "products")

print(f"Score: {report['score']}/100")  # e.g., 85
print(f"Grade: {report['grade']}")      # e.g., 'B'
print(f"Issues: {report['issues']}")    # List of detected issues

Configuration:

scorer = QualityScorer(
    null_threshold=15,    # Deduct for nulls above 15%
    grade_a_min=85,       # A grade at 85+
    grade_b_min=70,       # B grade at 70+
    grade_c_min=55        # C grade at 55+
)

Testing (`tests/test_quality_scorer.py`)

29 comprehensive tests covering:

Quality report generation (basic, high quality, with issues)
Edge cases (empty DataFrame, insufficient data)
Duplicate detection in potential PKs
Empty column detection
Outlier detection using 3-sigma rule
Letter grade assignment
Column statistics accuracy
Score calculation logic
Configuration (default/custom thresholds)
Type hints and docstrings
Integration tests with realistic datasets

Test Results: ✅ All 29 tests passing
Code Coverage: 99% for quality module

Technical Notes

Outlier Detection

Uses 3-sigma rule: values outside mean ± 3×std are outliers
Requires at least 10 data points for meaningful detection
Capped at 10 points deduction total to avoid excessive penalties

Primary Key Detection

Columns with ≥95% unique values are flagged as potential PKs
Duplicate checking only applies to potential PK columns
Helps identify unique identifiers and reference data

Type Hints

Full type annotation support using from __future__ import annotations
All public methods have comprehensive docstrings with examples

Resolves

#34 - Implement missing QualityScorer module for Auto-Pilot mode

Checklist

🤖 Generated with Claude Code

…dling Implement structured exception hierarchy for excel-to-sql: - ExcelToSqlError (base) - All custom exceptions inherit from this - ExcelFileError - Excel file operation failures - ConfigurationError - Configuration issues - ValidationError - Data validation failures - DatabaseError - Database operation failures Features: - Context dictionary for additional error information - to_dict() method for serialization - Rich string representation with context details This enables better error handling, debugging, and user-friendly error messages throughout the application. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Update ExcelFile class to throw custom ExcelFileError instead of generic ValueError for better error handling: - read() - Throws ExcelFileError with file_path and operation context - read_all_sheets() - Specific error handling for empty/invalid files - read_sheets() - Wraps errors with ExcelFileError Improvements: - Distinguish between EmptyDataError (empty file) and ParserError (invalid format) - Include context (file_path, operation) for debugging - Preserve FileNotFoundError and PermissionError as-is - Chain original exceptions for full traceback This allows CLI to provide specific error messages and tips for common Excel file errors. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…able tips Replace generic exception handlers with specific exception types throughout the CLI: Import Command: - FileNotFoundError → "File not found" + tip to check path - PermissionError → "Permission denied" + tip to check permissions - EmptyDataError → "Empty Excel file" + tip to add data - ParserError → "Invalid Excel format" + tip to check file type - ConfigurationError → Config error + tip to check config files - ValidationError → Validation error with details - DatabaseError → Database error with context Export Command: - FileNotFoundError → "Table not found" + tip to import first - PermissionError → "Permission denied" + tip to check write access - DatabaseError → Database error with context Magic Command: - Improved error messages for file/sheet processing - Better exception handling in interactive mode quality reports - Replaced bare except: block with specific (AttributeError, TypeError) Status Command: - ConfigurationError for config-related failures Additional: - Added logger for unexpected errors - All error messages follow consistent format with tips - Debug mode shows full traceback on unexpected errors Fixes #35 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add 25 tests covering the custom exception hierarchy: ExcelToSqlError (base): - Base exception creation with and without context - to_dict() serialization ExcelFileError: - Creation with file_path and operation - Context dictionary inclusion - to_dict() serialization ConfigurationError: - Creation with config_file and config_key - Full context handling ValidationError: - Creation with field, value, and rule - Full context handling DatabaseError: - Creation with table, operation, and sql_error - Full context handling Exception Hierarchy: - All exceptions inherit from ExcelToSqlError - Base exception catches all custom types - Specific exception types can be caught individually - Exception chaining preserves original traceback All tests pass (25/25). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Implement comprehensive quality scoring system for pandas DataFrames with: - Quality score calculation (0-100) based on multiple factors: - Null value percentage deduction - Duplicate detection in potential primary keys - Empty column detection - Statistical outlier detection (3-sigma rule) - Letter grade assignment (A-D, F) - Detailed issue reporting with actionable recommendations - Per-column statistics: - Data type, null count/percentage - Unique count/percentage - Sample values - Primary key potential detection - Empty column flag - Configurable quality thresholds - Comprehensive docstrings with examples Resolves: #34 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add 29 tests covering all QualityScorer functionality: - Quality report generation (basic, high quality, with issues) - Empty DataFrame handling - Duplicate detection in potential PKs - Empty column detection - Outlier detection using 3-sigma rule - Letter grade assignment (A-D, F) - Column statistics (nulls, uniques, types, samples) - Primary key potential detection - Score calculation: - Perfect data scoring - Null value deductions - Duplicate deductions - Empty column deductions - Score floor at 0 - Configuration (default/custom thresholds) - Outlier detection edge cases (insufficient data, all null) - Type hints and docstrings - Integration tests with realistic data All tests passing with 99% code coverage for quality module. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

AliiiBenn and others added 6 commits January 23, 2026 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add QualityScorer module for automatic data quality assessment (#34)#46

feat: add QualityScorer module for automatic data quality assessment (#34)#46
AliiiBenn wants to merge 6 commits intomainfrom
implement-quality-scorer-34

AliiiBenn commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AliiiBenn commented Jan 23, 2026

Summary

Implementation Details

QualityScorer Class (excel_to_sql/auto_pilot/quality.py)

Testing (tests/test_quality_scorer.py)

Technical Notes

Outlier Detection

Primary Key Detection

Type Hints

Resolves

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

QualityScorer Class (`excel_to_sql/auto_pilot/quality.py`)

Testing (`tests/test_quality_scorer.py`)