feat: add QualityScorer module for automatic data quality assessment (#34)#46
Open
feat: add QualityScorer module for automatic data quality assessment (#34)#46
Conversation
…dling Implement structured exception hierarchy for excel-to-sql: - ExcelToSqlError (base) - All custom exceptions inherit from this - ExcelFileError - Excel file operation failures - ConfigurationError - Configuration issues - ValidationError - Data validation failures - DatabaseError - Database operation failures Features: - Context dictionary for additional error information - to_dict() method for serialization - Rich string representation with context details This enables better error handling, debugging, and user-friendly error messages throughout the application. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update ExcelFile class to throw custom ExcelFileError instead of generic ValueError for better error handling: - read() - Throws ExcelFileError with file_path and operation context - read_all_sheets() - Specific error handling for empty/invalid files - read_sheets() - Wraps errors with ExcelFileError Improvements: - Distinguish between EmptyDataError (empty file) and ParserError (invalid format) - Include context (file_path, operation) for debugging - Preserve FileNotFoundError and PermissionError as-is - Chain original exceptions for full traceback This allows CLI to provide specific error messages and tips for common Excel file errors. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…able tips Replace generic exception handlers with specific exception types throughout the CLI: Import Command: - FileNotFoundError → "File not found" + tip to check path - PermissionError → "Permission denied" + tip to check permissions - EmptyDataError → "Empty Excel file" + tip to add data - ParserError → "Invalid Excel format" + tip to check file type - ConfigurationError → Config error + tip to check config files - ValidationError → Validation error with details - DatabaseError → Database error with context Export Command: - FileNotFoundError → "Table not found" + tip to import first - PermissionError → "Permission denied" + tip to check write access - DatabaseError → Database error with context Magic Command: - Improved error messages for file/sheet processing - Better exception handling in interactive mode quality reports - Replaced bare except: block with specific (AttributeError, TypeError) Status Command: - ConfigurationError for config-related failures Additional: - Added logger for unexpected errors - All error messages follow consistent format with tips - Debug mode shows full traceback on unexpected errors Fixes #35 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add 25 tests covering the custom exception hierarchy: ExcelToSqlError (base): - Base exception creation with and without context - to_dict() serialization ExcelFileError: - Creation with file_path and operation - Context dictionary inclusion - to_dict() serialization ConfigurationError: - Creation with config_file and config_key - Full context handling ValidationError: - Creation with field, value, and rule - Full context handling DatabaseError: - Creation with table, operation, and sql_error - Full context handling Exception Hierarchy: - All exceptions inherit from ExcelToSqlError - Base exception catches all custom types - Specific exception types can be caught individually - Exception chaining preserves original traceback All tests pass (25/25). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implement comprehensive quality scoring system for pandas DataFrames with: - Quality score calculation (0-100) based on multiple factors: - Null value percentage deduction - Duplicate detection in potential primary keys - Empty column detection - Statistical outlier detection (3-sigma rule) - Letter grade assignment (A-D, F) - Detailed issue reporting with actionable recommendations - Per-column statistics: - Data type, null count/percentage - Unique count/percentage - Sample values - Primary key potential detection - Empty column flag - Configurable quality thresholds - Comprehensive docstrings with examples Resolves: #34 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add 29 tests covering all QualityScorer functionality: - Quality report generation (basic, high quality, with issues) - Empty DataFrame handling - Duplicate detection in potential PKs - Empty column detection - Outlier detection using 3-sigma rule - Letter grade assignment (A-D, F) - Column statistics (nulls, uniques, types, samples) - Primary key potential detection - Score calculation: - Perfect data scoring - Null value deductions - Duplicate deductions - Empty column deductions - Score floor at 0 - Configuration (default/custom thresholds) - Outlier detection edge cases (insufficient data, all null) - Type hints and docstrings - Integration tests with realistic data All tests passing with 99% code coverage for quality module. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
QualityScorermodule for automatic data quality assessment of pandas DataFramesImplementation Details
QualityScorer Class (
excel_to_sql/auto_pilot/quality.py)Core Features:
Quality Score (0-100): Starts at 100, deducts points for:
Letter Grade Assignment:
Issue Detection:
Per-Column Statistics:
API Usage:
Configuration:
Testing (
tests/test_quality_scorer.py)29 comprehensive tests covering:
Test Results: ✅ All 29 tests passing
Code Coverage: 99% for quality module
Technical Notes
Outlier Detection
Primary Key Detection
Type Hints
from __future__ import annotationsResolves
#34 - Implement missing QualityScorer module for Auto-Pilot mode
Checklist
🤖 Generated with Claude Code