Skip to content

#2: Validate Command #2

@AliiiBenn

Description

@AliiiBenn

Issue #2: Validate Command

Problem

Users can unknowingly attempt to import Excel files that contain:

  • Missing or misnamed columns
  • Incorrect data types (text in numeric columns, etc.)
  • Duplicate primary keys
  • Null values in required fields
  • Invalid date formats
  • Empty sheets or missing required sheets

This leads to:

  • Failed imports with cryptic SQL errors
  • Partial data imports (some tables succeed, others fail)
  • Data corruption (wrong types silently converted)
  • Time wasted debugging import failures

Currently, users only discover these issues during the import process, which can fail halfway through after processing thousands of rows.

Proposed Solution

Implement a wareflow validate command that performs comprehensive data validation BEFORE import, providing clear, actionable error messages.

Command Signature

wareflow validate [--file PATH] [--all] [--strict] [--fix]

Options

Option Description Default
--file PATH Validate specific Excel file All files in data/
--all Validate all Excel files in data/ True
--strict Fail on warnings (not just errors) False
--fix Attempt auto-fix for simple issues False

Validation Checks

1. File-Level Checks

✓ Checking file: data/produits.xlsx
  ✓ File exists
  ✓ File is readable
  ✓ File format: Excel (.xlsx)
  ✓ File size: 145 KB

2. Sheet-Level Checks

✓ Checking sheets...
  ✓ Sheet 'produits' exists
  ✓ Sheet 'produits' has data (1,234 rows)
  ⚠️  Sheet 'metadata' found (will be ignored)

3. Column-Level Checks

✓ Checking columns for sheet 'produits'...
  Required columns (schema.sql):
    ✓ no_produit (INTEGER)
    ✓ nom_produit (TEXT)
    ✓ categorie (TEXT)
    ✓ prix_unitaire (REAL)
    ✓ stock_actuel (INTEGER)
    ✓ stock_minimal (INTEGER)
    ✓ emplacement (TEXT)
    ✓ fournisseur (TEXT)

  All required columns present ✅

  Extra columns found:
    - barcode (TEXT) - will be ignored
    - poids (REAL) - will be ignored

4. Data Type Checks

✓ Checking data types...
  ✓ no_produit: all INTEGER values
  ✓ prix_unitaire: all REAL values (3 null values found)
  ⚠️  stock_actuel: 12 rows have non-INTEGER values
  ✓ stock_minimal: all INTEGER values

5. Primary Key Checks

✓ Checking primary keys...
  ✓ no_produit: 1,234 unique values
  ⚠️  no_produit: 23 duplicate values found:
    - 1001 (appears 3 times at rows 45, 467, 891)
    - 1045 (appears 2 times at rows 123, 456)
    - ...

6. Null Value Checks

✓ Checking null values in required fields...
  ⚠️  nom_produit: 5 null values at rows 234, 567, 789, 901, 1234
  ✓ categorie: no null values
  ✓ stock_actuel: no null values

7. Data Range Checks

✓ Checking data ranges...
  ✓ prix_unitaire: all values >= 0
  ⚠️  prix_unitaire: 8 values > 10,000 (possible errors)
  ✓ stock_actuel: all values >= 0

8. Date Format Checks

✓ Checking date formats (if applicable)...
  ✓ date_heure: all valid ISO 8601 dates
  ⚠️  date_creation: 45 rows have inconsistent formats

Output Examples

Success Case:

✅ Validation Passed

All files are valid and ready for import.

Files validated: 3
Total rows: 47,701
Errors found: 0
Warnings: 0

💡 Next step:
  Run 'wareflow import' to import data

Warnings Only:

⚠️  Validation Passed with Warnings

Files validated: 3
Total rows: 47,701
Errors: 0
Warnings: 12

Summary:
  produits.xlsx: 3 warnings
  mouvements.xlsx: 8 warnings
  commandes.xlsx: 1 warning

Run with --strict to fail on warnings
Review warnings above before importing

💡 Next step:
  Run 'wareflow import' to import (warnings will be ignored)
  Or run 'wareflow validate --fix' to attempt auto-fix

Errors Found:

❌ Validation Failed

Files validated: 3
Errors found: 8
Warnings: 5

ERRORS (must fix before import):

1. produits.xlsx - Missing required columns
   Missing: fournisseur
   Found: barcode, poids, commentaire

2. produits.xlsx - Primary key duplicates
   23 duplicate no_produit values found

3. mouvements.xlsx - Data type mismatch
   Column: quantite
   Expected: INTEGER
   Found: 1,234 TEXT values

4. commandes.xlsx - Empty sheet
   Sheet: commandes_lines has 0 rows

💡 Solutions:
  1. Fix column names in Excel files
  2. Remove duplicate primary keys
  3. Correct data types in Excel
  4. Ensure all sheets have data
  5. Run 'wareflow validate --fix' for auto-fixable issues
  6. Re-run validation after fixes

❌ Cannot proceed with import until errors are fixed

Auto-Fix Mode:

✓ Running auto-fix...

Fixed issues:
  ✓ Trimmed whitespace in 1,234 cells
  ✓ Standardized date formats (45 cells)
  ✓ Removed duplicate rows (23 rows)
  ✓ Converted text numbers to integers (156 cells)

Remaining issues: 3 (require manual fixes)

⚠️  Auto-fix completed with warnings
  Review changes before importing
  Original files backed up to: data/.backup/

Implementation Requirements

File Structure

src/wareflow_analysis/
├── cli.py                      # Add validate command
└── commands/
    └── validate/               # New module
        ├── __init__.py
        ├── validator.py        # Main validation orchestration
        ├── checks.py           # Individual validation checks
        ├── schema_loader.py    # Load schema.sql for requirements
        ├── excel_reader.py     # Read Excel files
        ├── auto_fix.py         # Auto-fix capabilities
        └── reporters.py        # Formatted output

Key Components

1. Schema Loader

class SchemaLoader:
    def load_schema(schema_path: Path) -> Dict
    def extract_table_requirements(schema_sql: str) -> Dict
    # Returns:
    # {
    #     "produits": {
    #         "columns": ["no_produit", "nom_produit", ...],
    #         "primary_key": "no_produit",
    #         "required": ["no_produit", "nom_produit"],
    #         "types": {"no_produit": "INTEGER", ...}
    #     }
    # }

2. Excel Validator

class ExcelValidator:
    def validate_file(file_path: Path, schema: Dict) -> ValidationResult
    def check_sheets_exist(df_dict: pd.DataFrame, required: List[str])
    def check_columns(df: pd.DataFrame, table_schema: Dict)
    def check_data_types(df: pd.DataFrame, column_types: Dict)
    def check_primary_keys(df: pd.DataFrame, pk_column: str)
    def check_null_values(df: pd.DataFrame, required_columns: List[str])
    def check_data_ranges(df: pd.DataFrame, rules: Dict)
    def check_dates_format(df: pd.DataFrame, date_columns: List[str])

3. Validation Result

@dataclass
class ValidationResult:
    is_valid: bool
    errors: List[ValidationError]
    warnings: List[ValidationWarning]
    file_info: FileInfo
    summary: ValidationSummary

@dataclass
class ValidationError:
    sheet: str
    row: Optional[int]  # None if sheet-level error
    column: Optional[str]
    severity: str  # "error" or "warning"
    code: str  # "MISSING_COLUMN", "DUPLICATE_PK", etc.
    message: str
    suggestion: str

4. Auto-Fixer

class AutoFixer:
    def fix_whitespace(df: pd.DataFrame) -> pd.DataFrame
    def fix_date_formats(df: pd.DataFrame, columns: List[str]) -> pd.DataFrame
    def fix_numeric_strings(df: pd.DataFrame, columns: List[str]) -> pd.DataFrame
    def remove_duplicate_rows(df: pd.DataFrame, pk_column: str) -> pd.DataFrame
    def backup_file(file_path: Path) -> Path
    def apply_fixes(file_path: Path, fixes: List[Fix]) -> bool

5. Error Codes

ERROR_CODES = {
    "FILE_NOT_FOUND": "Excel file not found",
    "NOT_EXCEL_FILE": "File is not a valid Excel file",
    "EMPTY_FILE": "Excel file is empty",
    "SHEET_MISSING": "Required sheet not found",
    "SHEET_EMPTY": "Sheet has no data",
    "COLUMN_MISSING": "Required column not found",
    "COLUMN_EXTRA": "Extra column will be ignored",
    "TYPE_MISMATCH": "Data type doesn't match schema",
    "DUPLICATE_PK": "Duplicate primary key values",
    "NULL_REQUIRED": "Null value in required field",
    "VALUE_OUT_OF_RANGE": "Value outside valid range",
    "INVALID_DATE": "Invalid date format",
    "INCONSISTENT_DATE": "Inconsistent date formats"
}

Acceptance Criteria

Functionality

  • Validates all Excel files in data/ by default
  • Validates single file with --file option
  • Checks all sheets exist and are non-empty
  • Validates all required columns present
  • Validates primary key uniqueness
  • Checks data types match schema
  • Detects null values in required fields
  • Validates data ranges (numeric >= 0, etc.)
  • Checks date format consistency
  • Reports errors and warnings clearly
  • Shows row-level error locations
  • Provides actionable suggestions for each error

Auto-Fix

  • --fix flag attempts auto-fix for simple issues
  • Trims whitespace in cells
  • Standardizes date formats
  • Converts text numbers to numeric types
  • Removes exact duplicate rows
  • Creates backup before modifying files
  • Reports which fixes were applied

Output

  • Color-coded output (error ❌, warning ⚠️, success ✅)
  • Summary with error/warning counts
  • Detailed per-file breakdown
  • Clear next steps/suggestions
  • Exit code 0 on success, 1 on errors

Performance

  • Validates large files (>10k rows) in < 5 seconds
  • Minimal memory usage (streams large files)
  • Progress indicator for long validations

Cross-Platform

  • Works on Windows, macOS, Linux
  • Handles different Excel date encoding
  • Handles different line endings in text exports

Technical Notes

Schema Parsing

Parse schema.sql to extract requirements:

CREATE TABLE produits (
    no_produit INTEGER PRIMARY KEY,
    nom_produit TEXT NOT NULL,
    ...
);

Extract:

  • Table name: produits
  • Primary key: no_produit
  • Required columns: no_produit, nom_produit (NOT NULL)
  • Column types: INTEGER, TEXT, etc.

Validation Flow

1. Load schema.sql
2. Parse table requirements
3. For each Excel file:
   a. Check file exists and readable
   b. Load all sheets with pandas
   c. For each sheet:
      - Check if it maps to a table
      - Validate columns
      - Validate data types
      - Validate primary keys
      - Validate null values
      - Validate data ranges
   d. Collect errors and warnings
4. Generate summary report
5. If --fix: apply auto-fixes and re-validate

Excel Sheet Mapping

Map Excel filenames to table names:

  • produits.xlsxproduits table
  • mouvements.xlsxmouvements table
  • commandes.xlsxcommandes table

Or load mapping from config.yaml:

tables:
  produits:
    source: data/produits.xlsx
    sheet: produits
  mouvements:
    source: data/mouvements.xlsx
    sheet: mouvements

Dependencies

  • pandas - Data manipulation
  • openpyxl - Excel file reading
  • Existing schema.sql - Defines validation rules

Out of Scope

  • Modifying Excel files (except auto-fix mode with backup)
  • Validating business logic (e.g., stock levels, pricing)
  • Schema validation (syntax of schema.sql itself)
  • Data quality recommendations beyond basic checks

Testing Requirements

# tests/test_validate.py
def test_validate_all_files_success()
def test_validate_single_file()
def test_validate_missing_file()
def test_validate_missing_sheet()
def test_validate_missing_columns()
def test_validate_duplicate_primary_keys()
def test_validate_type_mismatch()
def test_validate_null_values()
def test_validate_strict_mode_warnings_as_errors()
def test_auto_fix_whitespace()
def test_auto_fix_dates()
def test_auto_fix_creates_backup()
def test_validate_empty_sheet()
def test_validate_extra_columns_warning()
def test_validate_date_format_check()
def test_validate_numeric_range_check()

Priority

Priority: HIGH - Prevents import failures and data corruption

Estimated Effort: 6-8 hours

Complexity: MEDIUM - Requires schema parsing, data type validation, and error reporting

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions