New Architecture#2
Conversation
- Updated the .gitignore to include usage/ directory. - Refactored README.md to focus on mypy usage and streamlined content. - Introduced a new Product Requirements Document (prd.md) detailing the architecture and features of nextract V2. - Added multiple new modules for chunking, including fixed size, semantic, and hybrid chunkers. - Implemented a CLI for document extraction with commands for batch processing, schema suggestion, and configuration validation. - Established a new telemetry system for logging and metrics tracking. - Enhanced the core extraction logic and validation mechanisms. - Added comprehensive tests for the new features and improved existing test coverage.
- Introduced a new theme option in the CLI for HTML output, allowing users to select between light, dark, or system themes. - Updated the HTML formatter to apply the selected theme, including dynamic CSS for theme-specific styling. - Enhanced the HTML template structure to support theme switching and improved overall layout for better user experience.
Code reviewFound 1 issue:
The Lines 29 to 43 in 9a585af nextract/nextract/cli/commands/suggest_schema.py Lines 16 to 26 in 9a585af nextract/nextract/cli/commands/batch.py Lines 20 to 38 in 9a585af 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
- Move tests to unit/integration subdirectories for better organization - Add comprehensive integration test suites for chunkers, CLI, extractors, pipelines, providers, and schemas - Add new unit tests for core artifacts, config, merge, and schema utilities - Add CLAUDE.md with detailed architecture documentation for AI assistants - Add get_default_model_for_provider() function with provider-specific default models - Update CLI commands and extract_simple to use provider-specific model defaults
- Removed `schema_splitter.py` as it is no longer needed. - Updated type hints across various modules to use `list` and `dict` instead of `List` and `Dict`. - Improved logging setup in `logging.py` to encourage user configuration. - Enhanced metrics event container in `metrics.py` with updated type hints. - Refined business rule validation logic in `business_rules.py` for better clarity. - Adjusted consistency validation to use updated type hints. - Updated plan validation logic in `plan_validator.py` for better error handling and clarity. - Improved schema validation logic in `schema_validator.py` for better error reporting. - Updated `requirements.txt` to reflect new dependency versions. - Revised test suite documentation to provide clearer instructions and updated test counts. - Added new unit tests for hybrid chunker and Textract extractor. - Enhanced parallel processing tests to improve error handling and result validation. - Improved schema utility tests to cover array schemas and structured output.
| # Spreadsheets are binary formats, not textual | ||
| ".xls", ".xlsx", | ||
| } |
There was a problem hiding this comment.
🔴 Excel (.xls/.xlsx) reclassification makes text extraction dead code, routes files to PDF conversion instead
Moving .xls and .xlsx from _TEXTUAL_EXTS to _UNSUPPORTED_BUT_ACCEPTABLE_AS_BINARY in nextract/mimetypes_map.py causes is_textual() to return False for these extensions. However, _prepare_single_file in nextract/files.py:308-330 still has specialized Excel→TSV/CSV text extraction logic gated by is_textual(path). Since that guard now evaluates to False, the entire Excel text-extraction branch (.xlsx→TSV via _xlsx_to_text, .xls→CSV via _xls_to_text_via_cli) is unreachable dead code. Instead, Excel files now fall through to is_office_binary() at nextract/files.py:383, which attempts Office→PDF conversion — a much worse outcome for spreadsheets that were previously extracted as structured text.
Prompt for agents
The root issue is in nextract/mimetypes_map.py lines 36-41: .xls and .xlsx were moved from _TEXTUAL_EXTS to _UNSUPPORTED_BUT_ACCEPTABLE_AS_BINARY. But nextract/files.py _prepare_single_file (lines 308-330) has specialized Excel text-extraction code gated by is_textual(path), which now always returns False for these extensions. Either: (1) Move .xls/.xlsx back to _TEXTUAL_EXTS and remove them from _UNSUPPORTED_BUT_ACCEPTABLE_AS_BINARY, OR (2) Update _prepare_single_file in files.py to check for Excel extensions before the is_textual() guard (e.g. add an explicit check like `if path.suffix.lower() in {".xlsx", ".xls"}:` before line 308).
Was this helpful? React with 👍 or 👎 to provide feedback.
| if strategy == "union": | ||
| if values: | ||
| # For list values, merge all lists | ||
| merged_list = [] | ||
| for v in values: | ||
| if isinstance(v, list): | ||
| merged_list.extend(v) | ||
| elif v: | ||
| merged_list.append(v) | ||
| merged[field_name] = merged_list if merged_list else values[0] |
There was a problem hiding this comment.
🔴 Multipass 'union' merge strategy converts scalar fields to lists, breaking schema validation
The new union merge strategy in _merge_results wraps all field values into a list via merged_list.append(v). For scalar fields (strings, numbers), this converts e.g. {"name": "John"} across two passes into {"name": ["John", "John"]} instead of keeping it as {"name": "John"}. This breaks JSON Schema validation for any non-array field. The old behavior treated union as equivalent to first_non_empty (take first non-empty value), which preserved scalar types.
Example of the breakage
Given two passes both returning {"name": "John", "total": 500} with a schema {"properties": {"name": {"type": "string"}, "total": {"type": "number"}}}, the union merge now produces {"name": ["John", "John"], "total": [500, 500]} — both fields are now lists, violating the schema.
| if strategy == "union": | |
| if values: | |
| # For list values, merge all lists | |
| merged_list = [] | |
| for v in values: | |
| if isinstance(v, list): | |
| merged_list.extend(v) | |
| elif v: | |
| merged_list.append(v) | |
| merged[field_name] = merged_list if merged_list else values[0] | |
| if strategy == "union": | |
| if values: | |
| # For list values, merge all lists; for scalars, take first | |
| if any(isinstance(v, list) for v in values): | |
| merged_list = [] | |
| for v in values: | |
| if isinstance(v, list): | |
| merged_list.extend(v) | |
| elif v is not None: | |
| merged_list.append(v) | |
| merged[field_name] = merged_list if merged_list else values[0] | |
| else: | |
| merged[field_name] = values[0] |
Was this helpful? React with 👍 or 👎 to provide feedback.
| elif v: | ||
| merged_list.append(v) |
There was a problem hiding this comment.
🔴 Multipass 'union' merge drops falsy values like 0 and False
In the union merge strategy at nextract/multipass.py:288, the condition elif v: uses a truthiness check. This silently drops legitimate falsy values such as 0, 0.0, and False from the merged output. For example, a field with "total": 0 would pass the earlier filter at line 277 (value is not None and value != "" and value != []) but then be skipped by the elif v: guard, causing data loss.
| elif v: | |
| merged_list.append(v) | |
| elif v is not None: | |
| merged_list.append(v) |
Was this helpful? React with 👍 or 👎 to provide feedback.
Audio and video files previously failed in chunkers — SemanticChunker returned empty results and PageBasedChunker raised ValueError. Now all chunkers produce a single passthrough DocumentChunk for media files, allowing them to flow through the pipeline to providers that support audio/video natively (e.g., Gemini).
Uh oh!
There was an error while loading. Please reload this page.