#2: Validate Command

# Issue #2: Validate Command

## Problem

Users can unknowingly attempt to import Excel files that contain:
- Missing or misnamed columns
- Incorrect data types (text in numeric columns, etc.)
- Duplicate primary keys
- Null values in required fields
- Invalid date formats
- Empty sheets or missing required sheets

This leads to:
- Failed imports with cryptic SQL errors
- Partial data imports (some tables succeed, others fail)
- Data corruption (wrong types silently converted)
- Time wasted debugging import failures

Currently, users only discover these issues during the import process, which can fail halfway through after processing thousands of rows.

## Proposed Solution

Implement a `wareflow validate` command that performs comprehensive data validation BEFORE import, providing clear, actionable error messages.

### Command Signature

```bash
wareflow validate [--file PATH] [--all] [--strict] [--fix]
```

### Options

| Option | Description | Default |
|--------|-------------|---------|
| `--file PATH` | Validate specific Excel file | All files in data/ |
| `--all` | Validate all Excel files in data/ | True |
| `--strict` | Fail on warnings (not just errors) | False |
| `--fix` | Attempt auto-fix for simple issues | False |

### Validation Checks

#### 1. File-Level Checks
```
✓ Checking file: data/produits.xlsx
  ✓ File exists
  ✓ File is readable
  ✓ File format: Excel (.xlsx)
  ✓ File size: 145 KB
```

#### 2. Sheet-Level Checks
```
✓ Checking sheets...
  ✓ Sheet 'produits' exists
  ✓ Sheet 'produits' has data (1,234 rows)
  ⚠️  Sheet 'metadata' found (will be ignored)
```

#### 3. Column-Level Checks
```
✓ Checking columns for sheet 'produits'...
  Required columns (schema.sql):
    ✓ no_produit (INTEGER)
    ✓ nom_produit (TEXT)
    ✓ categorie (TEXT)
    ✓ prix_unitaire (REAL)
    ✓ stock_actuel (INTEGER)
    ✓ stock_minimal (INTEGER)
    ✓ emplacement (TEXT)
    ✓ fournisseur (TEXT)

  All required columns present ✅

  Extra columns found:
    - barcode (TEXT) - will be ignored
    - poids (REAL) - will be ignored
```

#### 4. Data Type Checks
```
✓ Checking data types...
  ✓ no_produit: all INTEGER values
  ✓ prix_unitaire: all REAL values (3 null values found)
  ⚠️  stock_actuel: 12 rows have non-INTEGER values
  ✓ stock_minimal: all INTEGER values
```

#### 5. Primary Key Checks
```
✓ Checking primary keys...
  ✓ no_produit: 1,234 unique values
  ⚠️  no_produit: 23 duplicate values found:
    - 1001 (appears 3 times at rows 45, 467, 891)
    - 1045 (appears 2 times at rows 123, 456)
    - ...
```

#### 6. Null Value Checks
```
✓ Checking null values in required fields...
  ⚠️  nom_produit: 5 null values at rows 234, 567, 789, 901, 1234
  ✓ categorie: no null values
  ✓ stock_actuel: no null values
```

#### 7. Data Range Checks
```
✓ Checking data ranges...
  ✓ prix_unitaire: all values >= 0
  ⚠️  prix_unitaire: 8 values > 10,000 (possible errors)
  ✓ stock_actuel: all values >= 0
```

#### 8. Date Format Checks
```
✓ Checking date formats (if applicable)...
  ✓ date_heure: all valid ISO 8601 dates
  ⚠️  date_creation: 45 rows have inconsistent formats
```

### Output Examples

**Success Case:**
```
✅ Validation Passed

All files are valid and ready for import.

Files validated: 3
Total rows: 47,701
Errors found: 0
Warnings: 0

💡 Next step:
  Run 'wareflow import' to import data
```

**Warnings Only:**
```
⚠️  Validation Passed with Warnings

Files validated: 3
Total rows: 47,701
Errors: 0
Warnings: 12

Summary:
  produits.xlsx: 3 warnings
  mouvements.xlsx: 8 warnings
  commandes.xlsx: 1 warning

Run with --strict to fail on warnings
Review warnings above before importing

💡 Next step:
  Run 'wareflow import' to import (warnings will be ignored)
  Or run 'wareflow validate --fix' to attempt auto-fix
```

**Errors Found:**
```
❌ Validation Failed

Files validated: 3
Errors found: 8
Warnings: 5

ERRORS (must fix before import):

1. produits.xlsx - Missing required columns
   Missing: fournisseur
   Found: barcode, poids, commentaire

2. produits.xlsx - Primary key duplicates
   23 duplicate no_produit values found

3. mouvements.xlsx - Data type mismatch
   Column: quantite
   Expected: INTEGER
   Found: 1,234 TEXT values

4. commandes.xlsx - Empty sheet
   Sheet: commandes_lines has 0 rows

💡 Solutions:
  1. Fix column names in Excel files
  2. Remove duplicate primary keys
  3. Correct data types in Excel
  4. Ensure all sheets have data
  5. Run 'wareflow validate --fix' for auto-fixable issues
  6. Re-run validation after fixes

❌ Cannot proceed with import until errors are fixed
```

**Auto-Fix Mode:**
```
✓ Running auto-fix...

Fixed issues:
  ✓ Trimmed whitespace in 1,234 cells
  ✓ Standardized date formats (45 cells)
  ✓ Removed duplicate rows (23 rows)
  ✓ Converted text numbers to integers (156 cells)

Remaining issues: 3 (require manual fixes)

⚠️  Auto-fix completed with warnings
  Review changes before importing
  Original files backed up to: data/.backup/
```

## Implementation Requirements

### File Structure
```
src/wareflow_analysis/
├── cli.py                      # Add validate command
└── commands/
    └── validate/               # New module
        ├── __init__.py
        ├── validator.py        # Main validation orchestration
        ├── checks.py           # Individual validation checks
        ├── schema_loader.py    # Load schema.sql for requirements
        ├── excel_reader.py     # Read Excel files
        ├── auto_fix.py         # Auto-fix capabilities
        └── reporters.py        # Formatted output
```

### Key Components

**1. Schema Loader**
```python
class SchemaLoader:
    def load_schema(schema_path: Path) -> Dict
    def extract_table_requirements(schema_sql: str) -> Dict
    # Returns:
    # {
    #     "produits": {
    #         "columns": ["no_produit", "nom_produit", ...],
    #         "primary_key": "no_produit",
    #         "required": ["no_produit", "nom_produit"],
    #         "types": {"no_produit": "INTEGER", ...}
    #     }
    # }
```

**2. Excel Validator**
```python
class ExcelValidator:
    def validate_file(file_path: Path, schema: Dict) -> ValidationResult
    def check_sheets_exist(df_dict: pd.DataFrame, required: List[str])
    def check_columns(df: pd.DataFrame, table_schema: Dict)
    def check_data_types(df: pd.DataFrame, column_types: Dict)
    def check_primary_keys(df: pd.DataFrame, pk_column: str)
    def check_null_values(df: pd.DataFrame, required_columns: List[str])
    def check_data_ranges(df: pd.DataFrame, rules: Dict)
    def check_dates_format(df: pd.DataFrame, date_columns: List[str])
```

**3. Validation Result**
```python
@dataclass
class ValidationResult:
    is_valid: bool
    errors: List[ValidationError]
    warnings: List[ValidationWarning]
    file_info: FileInfo
    summary: ValidationSummary

@dataclass
class ValidationError:
    sheet: str
    row: Optional[int]  # None if sheet-level error
    column: Optional[str]
    severity: str  # "error" or "warning"
    code: str  # "MISSING_COLUMN", "DUPLICATE_PK", etc.
    message: str
    suggestion: str
```

**4. Auto-Fixer**
```python
class AutoFixer:
    def fix_whitespace(df: pd.DataFrame) -> pd.DataFrame
    def fix_date_formats(df: pd.DataFrame, columns: List[str]) -> pd.DataFrame
    def fix_numeric_strings(df: pd.DataFrame, columns: List[str]) -> pd.DataFrame
    def remove_duplicate_rows(df: pd.DataFrame, pk_column: str) -> pd.DataFrame
    def backup_file(file_path: Path) -> Path
    def apply_fixes(file_path: Path, fixes: List[Fix]) -> bool
```

**5. Error Codes**
```python
ERROR_CODES = {
    "FILE_NOT_FOUND": "Excel file not found",
    "NOT_EXCEL_FILE": "File is not a valid Excel file",
    "EMPTY_FILE": "Excel file is empty",
    "SHEET_MISSING": "Required sheet not found",
    "SHEET_EMPTY": "Sheet has no data",
    "COLUMN_MISSING": "Required column not found",
    "COLUMN_EXTRA": "Extra column will be ignored",
    "TYPE_MISMATCH": "Data type doesn't match schema",
    "DUPLICATE_PK": "Duplicate primary key values",
    "NULL_REQUIRED": "Null value in required field",
    "VALUE_OUT_OF_RANGE": "Value outside valid range",
    "INVALID_DATE": "Invalid date format",
    "INCONSISTENT_DATE": "Inconsistent date formats"
}
```

## Acceptance Criteria

### Functionality
- [ ] Validates all Excel files in `data/` by default
- [ ] Validates single file with `--file` option
- [ ] Checks all sheets exist and are non-empty
- [ ] Validates all required columns present
- [ ] Validates primary key uniqueness
- [ ] Checks data types match schema
- [ ] Detects null values in required fields
- [ ] Validates data ranges (numeric >= 0, etc.)
- [ ] Checks date format consistency
- [ ] Reports errors and warnings clearly
- [ ] Shows row-level error locations
- [ ] Provides actionable suggestions for each error

### Auto-Fix
- [ ] `--fix` flag attempts auto-fix for simple issues
- [ ] Trims whitespace in cells
- [ ] Standardizes date formats
- [ ] Converts text numbers to numeric types
- [ ] Removes exact duplicate rows
- [ ] Creates backup before modifying files
- [ ] Reports which fixes were applied

### Output
- [ ] Color-coded output (error ❌, warning ⚠️, success ✅)
- [ ] Summary with error/warning counts
- [ ] Detailed per-file breakdown
- [ ] Clear next steps/suggestions
- [ ] Exit code 0 on success, 1 on errors

### Performance
- [ ] Validates large files (>10k rows) in < 5 seconds
- [ ] Minimal memory usage (streams large files)
- [ ] Progress indicator for long validations

### Cross-Platform
- [ ] Works on Windows, macOS, Linux
- [ ] Handles different Excel date encoding
- [ ] Handles different line endings in text exports

## Technical Notes

### Schema Parsing
Parse `schema.sql` to extract requirements:
```sql
CREATE TABLE produits (
    no_produit INTEGER PRIMARY KEY,
    nom_produit TEXT NOT NULL,
    ...
);
```

Extract:
- Table name: `produits`
- Primary key: `no_produit`
- Required columns: `no_produit`, `nom_produit` (NOT NULL)
- Column types: `INTEGER`, `TEXT`, etc.

### Validation Flow
```
1. Load schema.sql
2. Parse table requirements
3. For each Excel file:
   a. Check file exists and readable
   b. Load all sheets with pandas
   c. For each sheet:
      - Check if it maps to a table
      - Validate columns
      - Validate data types
      - Validate primary keys
      - Validate null values
      - Validate data ranges
   d. Collect errors and warnings
4. Generate summary report
5. If --fix: apply auto-fixes and re-validate
```

### Excel Sheet Mapping
Map Excel filenames to table names:
- `produits.xlsx` → `produits` table
- `mouvements.xlsx` → `mouvements` table
- `commandes.xlsx` → `commandes` table

Or load mapping from `config.yaml`:
```yaml
tables:
  produits:
    source: data/produits.xlsx
    sheet: produits
  mouvements:
    source: data/mouvements.xlsx
    sheet: mouvements
```

## Dependencies

- `pandas` - Data manipulation
- `openpyxl` - Excel file reading
- Existing `schema.sql` - Defines validation rules

## Out of Scope

- Modifying Excel files (except auto-fix mode with backup)
- Validating business logic (e.g., stock levels, pricing)
- Schema validation (syntax of schema.sql itself)
- Data quality recommendations beyond basic checks

## Testing Requirements

```python
# tests/test_validate.py
def test_validate_all_files_success()
def test_validate_single_file()
def test_validate_missing_file()
def test_validate_missing_sheet()
def test_validate_missing_columns()
def test_validate_duplicate_primary_keys()
def test_validate_type_mismatch()
def test_validate_null_values()
def test_validate_strict_mode_warnings_as_errors()
def test_auto_fix_whitespace()
def test_auto_fix_dates()
def test_auto_fix_creates_backup()
def test_validate_empty_sheet()
def test_validate_extra_columns_warning()
def test_validate_date_format_check()
def test_validate_numeric_range_check()
```

## Priority

**Priority: HIGH** - Prevents import failures and data corruption

**Estimated Effort**: 6-8 hours

**Complexity**: MEDIUM - Requires schema parsing, data type validation, and error reporting


Option	Description	Default
`--file PATH`	Validate specific Excel file	All files in data/
`--all`	Validate all Excel files in data/	True
`--strict`	Fail on warnings (not just errors)	False
`--fix`	Attempt auto-fix for simple issues	False

#2: Validate Command #2

Description

Issue #2: Validate Command

Problem

Proposed Solution

Command Signature

Options

Validation Checks

1. File-Level Checks

2. Sheet-Level Checks

3. Column-Level Checks

4. Data Type Checks

5. Primary Key Checks

6. Null Value Checks

7. Data Range Checks

8. Date Format Checks

Output Examples

Implementation Requirements

File Structure

Key Components

Acceptance Criteria

Functionality

Auto-Fix

Output

Performance

Cross-Platform

Technical Notes

Schema Parsing

Validation Flow

Excel Sheet Mapping

Dependencies

Out of Scope

Testing Requirements

Priority

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions