Skip to content

feat: add QualityScorer module for automatic data quality assessment (#34)#46

Open
AliiiBenn wants to merge 6 commits intomainfrom
implement-quality-scorer-34
Open

feat: add QualityScorer module for automatic data quality assessment (#34)#46
AliiiBenn wants to merge 6 commits intomainfrom
implement-quality-scorer-34

Conversation

@AliiiBenn
Copy link
Member

Summary

  • Implements the QualityScorer module for automatic data quality assessment of pandas DataFrames
  • Provides comprehensive quality reports with scores, grades, and actionable recommendations
  • Detects common data quality issues: null values, duplicates, empty columns, outliers
  • Includes 29 comprehensive tests with 99% code coverage

Implementation Details

QualityScorer Class (excel_to_sql/auto_pilot/quality.py)

Core Features:

  • Quality Score (0-100): Starts at 100, deducts points for:

    • Null values above threshold (default 10%)
    • Duplicates in potential primary key columns (2 points each)
    • Empty columns (5 points each)
    • Statistical outliers (0.1 points each, capped at 10 total)
  • Letter Grade Assignment:

    • A: 90-100 (excellent quality)
    • B: 75-89 (good quality)
    • C: 60-74 (acceptable quality)
    • D: 1-59 (poor quality)
    • F: 0 (unusable)
  • Issue Detection:

    • Null value percentage per column
    • Duplicate values in potential PK columns
    • Empty columns (100% null)
    • Type mismatches (numeric data stored as object)
    • Statistical outliers using 3-sigma rule
  • Per-Column Statistics:

    • Data type, null count/percentage
    • Unique count/percentage
    • Sample values (top 5)
    • Primary key potential (≥95% unique)
    • Empty column flag

API Usage:

from excel_to_sql.auto_pilot.quality import QualityScorer

scorer = QualityScorer()
report = scorer.generate_quality_report(df, "products")

print(f"Score: {report['score']}/100")  # e.g., 85
print(f"Grade: {report['grade']}")      # e.g., 'B'
print(f"Issues: {report['issues']}")    # List of detected issues

Configuration:

scorer = QualityScorer(
    null_threshold=15,    # Deduct for nulls above 15%
    grade_a_min=85,       # A grade at 85+
    grade_b_min=70,       # B grade at 70+
    grade_c_min=55        # C grade at 55+
)

Testing (tests/test_quality_scorer.py)

29 comprehensive tests covering:

  • Quality report generation (basic, high quality, with issues)
  • Edge cases (empty DataFrame, insufficient data)
  • Duplicate detection in potential PKs
  • Empty column detection
  • Outlier detection using 3-sigma rule
  • Letter grade assignment
  • Column statistics accuracy
  • Score calculation logic
  • Configuration (default/custom thresholds)
  • Type hints and docstrings
  • Integration tests with realistic datasets

Test Results: ✅ All 29 tests passing
Code Coverage: 99% for quality module

Technical Notes

Outlier Detection

  • Uses 3-sigma rule: values outside mean ± 3×std are outliers
  • Requires at least 10 data points for meaningful detection
  • Capped at 10 points deduction total to avoid excessive penalties

Primary Key Detection

  • Columns with ≥95% unique values are flagged as potential PKs
  • Duplicate checking only applies to potential PK columns
  • Helps identify unique identifiers and reference data

Type Hints

  • Full type annotation support using from __future__ import annotations
  • All public methods have comprehensive docstrings with examples

Resolves

#34 - Implement missing QualityScorer module for Auto-Pilot mode

Checklist

  • Implementation complete
  • All tests passing (29/29)
  • Code coverage: 99%
  • Docstrings and examples added
  • Atomic commits created
  • Ready for review

🤖 Generated with Claude Code

AliiiBenn and others added 6 commits January 23, 2026 14:57
…dling

Implement structured exception hierarchy for excel-to-sql:

- ExcelToSqlError (base) - All custom exceptions inherit from this
- ExcelFileError - Excel file operation failures
- ConfigurationError - Configuration issues
- ValidationError - Data validation failures
- DatabaseError - Database operation failures

Features:
- Context dictionary for additional error information
- to_dict() method for serialization
- Rich string representation with context details

This enables better error handling, debugging, and user-friendly
error messages throughout the application.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update ExcelFile class to throw custom ExcelFileError instead of
generic ValueError for better error handling:

- read() - Throws ExcelFileError with file_path and operation context
- read_all_sheets() - Specific error handling for empty/invalid files
- read_sheets() - Wraps errors with ExcelFileError

Improvements:
- Distinguish between EmptyDataError (empty file) and ParserError (invalid format)
- Include context (file_path, operation) for debugging
- Preserve FileNotFoundError and PermissionError as-is
- Chain original exceptions for full traceback

This allows CLI to provide specific error messages and tips for
common Excel file errors.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…able tips

Replace generic exception handlers with specific exception types
throughout the CLI:

Import Command:
- FileNotFoundError → "File not found" + tip to check path
- PermissionError → "Permission denied" + tip to check permissions
- EmptyDataError → "Empty Excel file" + tip to add data
- ParserError → "Invalid Excel format" + tip to check file type
- ConfigurationError → Config error + tip to check config files
- ValidationError → Validation error with details
- DatabaseError → Database error with context

Export Command:
- FileNotFoundError → "Table not found" + tip to import first
- PermissionError → "Permission denied" + tip to check write access
- DatabaseError → Database error with context

Magic Command:
- Improved error messages for file/sheet processing
- Better exception handling in interactive mode quality reports
- Replaced bare except: block with specific (AttributeError, TypeError)

Status Command:
- ConfigurationError for config-related failures

Additional:
- Added logger for unexpected errors
- All error messages follow consistent format with tips
- Debug mode shows full traceback on unexpected errors

Fixes #35

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add 25 tests covering the custom exception hierarchy:

ExcelToSqlError (base):
- Base exception creation with and without context
- to_dict() serialization

ExcelFileError:
- Creation with file_path and operation
- Context dictionary inclusion
- to_dict() serialization

ConfigurationError:
- Creation with config_file and config_key
- Full context handling

ValidationError:
- Creation with field, value, and rule
- Full context handling

DatabaseError:
- Creation with table, operation, and sql_error
- Full context handling

Exception Hierarchy:
- All exceptions inherit from ExcelToSqlError
- Base exception catches all custom types
- Specific exception types can be caught individually
- Exception chaining preserves original traceback

All tests pass (25/25).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implement comprehensive quality scoring system for pandas DataFrames with:

- Quality score calculation (0-100) based on multiple factors:
  - Null value percentage deduction
  - Duplicate detection in potential primary keys
  - Empty column detection
  - Statistical outlier detection (3-sigma rule)

- Letter grade assignment (A-D, F)
- Detailed issue reporting with actionable recommendations
- Per-column statistics:
  - Data type, null count/percentage
  - Unique count/percentage
  - Sample values
  - Primary key potential detection
  - Empty column flag

- Configurable quality thresholds
- Comprehensive docstrings with examples

Resolves: #34

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add 29 tests covering all QualityScorer functionality:

- Quality report generation (basic, high quality, with issues)
- Empty DataFrame handling
- Duplicate detection in potential PKs
- Empty column detection
- Outlier detection using 3-sigma rule
- Letter grade assignment (A-D, F)
- Column statistics (nulls, uniques, types, samples)
- Primary key potential detection
- Score calculation:
  - Perfect data scoring
  - Null value deductions
  - Duplicate deductions
  - Empty column deductions
  - Score floor at 0
- Configuration (default/custom thresholds)
- Outlier detection edge cases (insufficient data, all null)
- Type hints and docstrings
- Integration tests with realistic data

All tests passing with 99% code coverage for quality module.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant