-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Issue #2: Validate Command
Problem
Users can unknowingly attempt to import Excel files that contain:
- Missing or misnamed columns
- Incorrect data types (text in numeric columns, etc.)
- Duplicate primary keys
- Null values in required fields
- Invalid date formats
- Empty sheets or missing required sheets
This leads to:
- Failed imports with cryptic SQL errors
- Partial data imports (some tables succeed, others fail)
- Data corruption (wrong types silently converted)
- Time wasted debugging import failures
Currently, users only discover these issues during the import process, which can fail halfway through after processing thousands of rows.
Proposed Solution
Implement a wareflow validate command that performs comprehensive data validation BEFORE import, providing clear, actionable error messages.
Command Signature
wareflow validate [--file PATH] [--all] [--strict] [--fix]Options
| Option | Description | Default |
|---|---|---|
--file PATH |
Validate specific Excel file | All files in data/ |
--all |
Validate all Excel files in data/ | True |
--strict |
Fail on warnings (not just errors) | False |
--fix |
Attempt auto-fix for simple issues | False |
Validation Checks
1. File-Level Checks
✓ Checking file: data/produits.xlsx
✓ File exists
✓ File is readable
✓ File format: Excel (.xlsx)
✓ File size: 145 KB
2. Sheet-Level Checks
✓ Checking sheets...
✓ Sheet 'produits' exists
✓ Sheet 'produits' has data (1,234 rows)
⚠️ Sheet 'metadata' found (will be ignored)
3. Column-Level Checks
✓ Checking columns for sheet 'produits'...
Required columns (schema.sql):
✓ no_produit (INTEGER)
✓ nom_produit (TEXT)
✓ categorie (TEXT)
✓ prix_unitaire (REAL)
✓ stock_actuel (INTEGER)
✓ stock_minimal (INTEGER)
✓ emplacement (TEXT)
✓ fournisseur (TEXT)
All required columns present ✅
Extra columns found:
- barcode (TEXT) - will be ignored
- poids (REAL) - will be ignored
4. Data Type Checks
✓ Checking data types...
✓ no_produit: all INTEGER values
✓ prix_unitaire: all REAL values (3 null values found)
⚠️ stock_actuel: 12 rows have non-INTEGER values
✓ stock_minimal: all INTEGER values
5. Primary Key Checks
✓ Checking primary keys...
✓ no_produit: 1,234 unique values
⚠️ no_produit: 23 duplicate values found:
- 1001 (appears 3 times at rows 45, 467, 891)
- 1045 (appears 2 times at rows 123, 456)
- ...
6. Null Value Checks
✓ Checking null values in required fields...
⚠️ nom_produit: 5 null values at rows 234, 567, 789, 901, 1234
✓ categorie: no null values
✓ stock_actuel: no null values
7. Data Range Checks
✓ Checking data ranges...
✓ prix_unitaire: all values >= 0
⚠️ prix_unitaire: 8 values > 10,000 (possible errors)
✓ stock_actuel: all values >= 0
8. Date Format Checks
✓ Checking date formats (if applicable)...
✓ date_heure: all valid ISO 8601 dates
⚠️ date_creation: 45 rows have inconsistent formats
Output Examples
Success Case:
✅ Validation Passed
All files are valid and ready for import.
Files validated: 3
Total rows: 47,701
Errors found: 0
Warnings: 0
💡 Next step:
Run 'wareflow import' to import data
Warnings Only:
⚠️ Validation Passed with Warnings
Files validated: 3
Total rows: 47,701
Errors: 0
Warnings: 12
Summary:
produits.xlsx: 3 warnings
mouvements.xlsx: 8 warnings
commandes.xlsx: 1 warning
Run with --strict to fail on warnings
Review warnings above before importing
💡 Next step:
Run 'wareflow import' to import (warnings will be ignored)
Or run 'wareflow validate --fix' to attempt auto-fix
Errors Found:
❌ Validation Failed
Files validated: 3
Errors found: 8
Warnings: 5
ERRORS (must fix before import):
1. produits.xlsx - Missing required columns
Missing: fournisseur
Found: barcode, poids, commentaire
2. produits.xlsx - Primary key duplicates
23 duplicate no_produit values found
3. mouvements.xlsx - Data type mismatch
Column: quantite
Expected: INTEGER
Found: 1,234 TEXT values
4. commandes.xlsx - Empty sheet
Sheet: commandes_lines has 0 rows
💡 Solutions:
1. Fix column names in Excel files
2. Remove duplicate primary keys
3. Correct data types in Excel
4. Ensure all sheets have data
5. Run 'wareflow validate --fix' for auto-fixable issues
6. Re-run validation after fixes
❌ Cannot proceed with import until errors are fixed
Auto-Fix Mode:
✓ Running auto-fix...
Fixed issues:
✓ Trimmed whitespace in 1,234 cells
✓ Standardized date formats (45 cells)
✓ Removed duplicate rows (23 rows)
✓ Converted text numbers to integers (156 cells)
Remaining issues: 3 (require manual fixes)
⚠️ Auto-fix completed with warnings
Review changes before importing
Original files backed up to: data/.backup/
Implementation Requirements
File Structure
src/wareflow_analysis/
├── cli.py # Add validate command
└── commands/
└── validate/ # New module
├── __init__.py
├── validator.py # Main validation orchestration
├── checks.py # Individual validation checks
├── schema_loader.py # Load schema.sql for requirements
├── excel_reader.py # Read Excel files
├── auto_fix.py # Auto-fix capabilities
└── reporters.py # Formatted output
Key Components
1. Schema Loader
class SchemaLoader:
def load_schema(schema_path: Path) -> Dict
def extract_table_requirements(schema_sql: str) -> Dict
# Returns:
# {
# "produits": {
# "columns": ["no_produit", "nom_produit", ...],
# "primary_key": "no_produit",
# "required": ["no_produit", "nom_produit"],
# "types": {"no_produit": "INTEGER", ...}
# }
# }2. Excel Validator
class ExcelValidator:
def validate_file(file_path: Path, schema: Dict) -> ValidationResult
def check_sheets_exist(df_dict: pd.DataFrame, required: List[str])
def check_columns(df: pd.DataFrame, table_schema: Dict)
def check_data_types(df: pd.DataFrame, column_types: Dict)
def check_primary_keys(df: pd.DataFrame, pk_column: str)
def check_null_values(df: pd.DataFrame, required_columns: List[str])
def check_data_ranges(df: pd.DataFrame, rules: Dict)
def check_dates_format(df: pd.DataFrame, date_columns: List[str])3. Validation Result
@dataclass
class ValidationResult:
is_valid: bool
errors: List[ValidationError]
warnings: List[ValidationWarning]
file_info: FileInfo
summary: ValidationSummary
@dataclass
class ValidationError:
sheet: str
row: Optional[int] # None if sheet-level error
column: Optional[str]
severity: str # "error" or "warning"
code: str # "MISSING_COLUMN", "DUPLICATE_PK", etc.
message: str
suggestion: str4. Auto-Fixer
class AutoFixer:
def fix_whitespace(df: pd.DataFrame) -> pd.DataFrame
def fix_date_formats(df: pd.DataFrame, columns: List[str]) -> pd.DataFrame
def fix_numeric_strings(df: pd.DataFrame, columns: List[str]) -> pd.DataFrame
def remove_duplicate_rows(df: pd.DataFrame, pk_column: str) -> pd.DataFrame
def backup_file(file_path: Path) -> Path
def apply_fixes(file_path: Path, fixes: List[Fix]) -> bool5. Error Codes
ERROR_CODES = {
"FILE_NOT_FOUND": "Excel file not found",
"NOT_EXCEL_FILE": "File is not a valid Excel file",
"EMPTY_FILE": "Excel file is empty",
"SHEET_MISSING": "Required sheet not found",
"SHEET_EMPTY": "Sheet has no data",
"COLUMN_MISSING": "Required column not found",
"COLUMN_EXTRA": "Extra column will be ignored",
"TYPE_MISMATCH": "Data type doesn't match schema",
"DUPLICATE_PK": "Duplicate primary key values",
"NULL_REQUIRED": "Null value in required field",
"VALUE_OUT_OF_RANGE": "Value outside valid range",
"INVALID_DATE": "Invalid date format",
"INCONSISTENT_DATE": "Inconsistent date formats"
}Acceptance Criteria
Functionality
- Validates all Excel files in
data/by default - Validates single file with
--fileoption - Checks all sheets exist and are non-empty
- Validates all required columns present
- Validates primary key uniqueness
- Checks data types match schema
- Detects null values in required fields
- Validates data ranges (numeric >= 0, etc.)
- Checks date format consistency
- Reports errors and warnings clearly
- Shows row-level error locations
- Provides actionable suggestions for each error
Auto-Fix
-
--fixflag attempts auto-fix for simple issues - Trims whitespace in cells
- Standardizes date formats
- Converts text numbers to numeric types
- Removes exact duplicate rows
- Creates backup before modifying files
- Reports which fixes were applied
Output
- Color-coded output (error ❌, warning
⚠️ , success ✅) - Summary with error/warning counts
- Detailed per-file breakdown
- Clear next steps/suggestions
- Exit code 0 on success, 1 on errors
Performance
- Validates large files (>10k rows) in < 5 seconds
- Minimal memory usage (streams large files)
- Progress indicator for long validations
Cross-Platform
- Works on Windows, macOS, Linux
- Handles different Excel date encoding
- Handles different line endings in text exports
Technical Notes
Schema Parsing
Parse schema.sql to extract requirements:
CREATE TABLE produits (
no_produit INTEGER PRIMARY KEY,
nom_produit TEXT NOT NULL,
...
);Extract:
- Table name:
produits - Primary key:
no_produit - Required columns:
no_produit,nom_produit(NOT NULL) - Column types:
INTEGER,TEXT, etc.
Validation Flow
1. Load schema.sql
2. Parse table requirements
3. For each Excel file:
a. Check file exists and readable
b. Load all sheets with pandas
c. For each sheet:
- Check if it maps to a table
- Validate columns
- Validate data types
- Validate primary keys
- Validate null values
- Validate data ranges
d. Collect errors and warnings
4. Generate summary report
5. If --fix: apply auto-fixes and re-validate
Excel Sheet Mapping
Map Excel filenames to table names:
produits.xlsx→produitstablemouvements.xlsx→mouvementstablecommandes.xlsx→commandestable
Or load mapping from config.yaml:
tables:
produits:
source: data/produits.xlsx
sheet: produits
mouvements:
source: data/mouvements.xlsx
sheet: mouvementsDependencies
pandas- Data manipulationopenpyxl- Excel file reading- Existing
schema.sql- Defines validation rules
Out of Scope
- Modifying Excel files (except auto-fix mode with backup)
- Validating business logic (e.g., stock levels, pricing)
- Schema validation (syntax of schema.sql itself)
- Data quality recommendations beyond basic checks
Testing Requirements
# tests/test_validate.py
def test_validate_all_files_success()
def test_validate_single_file()
def test_validate_missing_file()
def test_validate_missing_sheet()
def test_validate_missing_columns()
def test_validate_duplicate_primary_keys()
def test_validate_type_mismatch()
def test_validate_null_values()
def test_validate_strict_mode_warnings_as_errors()
def test_auto_fix_whitespace()
def test_auto_fix_dates()
def test_auto_fix_creates_backup()
def test_validate_empty_sheet()
def test_validate_extra_columns_warning()
def test_validate_date_format_check()
def test_validate_numeric_range_check()Priority
Priority: HIGH - Prevents import failures and data corruption
Estimated Effort: 6-8 hours
Complexity: MEDIUM - Requires schema parsing, data type validation, and error reporting