New Architecture by saksham-nexla · Pull Request #2 · nexla-opensource/nextract

saksham-nexla · 2025-12-20T01:48:22Z

- Updated the .gitignore to include usage/ directory. - Refactored README.md to focus on mypy usage and streamlined content. - Introduced a new Product Requirements Document (prd.md) detailing the architecture and features of nextract V2. - Added multiple new modules for chunking, including fixed size, semantic, and hybrid chunkers. - Implemented a CLI for document extraction with commands for batch processing, schema suggestion, and configuration validation. - Established a new telemetry system for logging and metrics tracking. - Enhanced the core extraction logic and validation mechanisms. - Added comprehensive tests for the new features and improved existing test coverage.

- Introduced a new theme option in the CLI for HTML output, allowing users to select between light, dark, or system themes. - Updated the HTML formatter to apply the selected theme, including dynamic CSS for theme-specific styling. - Enhanced the HTML template structure to support theme switching and improved overall layout for better user experience.

saksham-nexla · 2025-12-23T10:07:28Z

Code review

Found 1 issue:

Hardcoded "gpt-4o" model name breaks non-OpenAI providers (bug in extract_simple and CLI commands)

The extract_simple() function defaults to model="gpt-4o" regardless of which provider is specified. This will cause runtime errors when users call extract_simple(document, schema, provider="anthropic") or any non-OpenAI provider without explicitly providing a model parameter. The model name "gpt-4o" is OpenAI-specific and will fail when passed to other providers like Anthropic, Google, Cohere, etc. The same bug appears in CLI commands suggest_schema.py and batch.py.

nextract/nextract/__init__.py

Lines 29 to 43 in 9a585af

    
           def extract_simple( 
        
               document: str, 
        
               schema: dict, 
        
               provider: str, 
        
               model: str | None = None, 
        
               prompt: str | None = None, 
        
           ) -> Any: 
        
               """Simplest extraction helper using a default text pipeline.""" 
        
               provider_config = ProviderConfig(name=provider, model=model or "gpt-4o") 
        
               extractor_config = ExtractorConfig(name="text", provider=provider_config) 
        
               chunker_config = ChunkerConfig(name="semantic") 
        
               plan = ExtractionPlan(extractor=extractor_config, chunker=chunker_config) 
        
               pipeline = ExtractionPipeline(plan) 
        
               return pipeline.extract(document=document, schema=schema, prompt=prompt)

nextract/nextract/cli/commands/suggest_schema.py

Lines 16 to 26 in 9a585af

    
           @app.command("suggest-schema") 
        
           def cli_suggest_schema( 
        
               samples: list[Path] = typer.Argument(..., exists=True, readable=True), 
        
               prompt: str = typer.Option(..., "--prompt", "-p"), 
        
               provider: str = typer.Option("openai", "--provider"), 
        
               model: Optional[str] = typer.Option(None, "--model"), 
        
               examples: Optional[Path] = typer.Option(None, "--examples"), 
        
               output: Optional[Path] = typer.Option(None, "--output", "-o"), 
        
           ) -> None: 
        
               provider_config = ProviderConfig(name=provider, model=model or "gpt-4o") 
        
               generator = SchemaGenerator(provider=provider_config)

nextract/nextract/cli/commands/batch.py

Lines 20 to 38 in 9a585af

    
           @app.command("batch") 
        
           def cli_batch( 
        
               documents: list[Path] = typer.Argument(..., exists=True, readable=True), 
        
               schema: Path = typer.Option(..., "--schema", "-s"), 
        
               prompt: Optional[str] = typer.Option(None, "--prompt", "-p"), 
        
               extractor: str = typer.Option("text", "--extractor"), 
        
               provider: str = typer.Option("openai", "--provider"), 
        
               model: Optional[str] = typer.Option(None, "--model"), 
        
               chunker: str = typer.Option("semantic", "--chunker"), 
        
               max_workers: int = typer.Option(4, "--max-workers"), 
        
               include_extra: bool = typer.Option(False, "--include-extra"), 
        
               num_passes: int = typer.Option(1, "--num-passes"), 
        
               enable_suggestions: bool = typer.Option(False, "--enable-suggestions"), 
        
           ) -> None: 
        
               schema_obj = _load_schema(schema) 
        
               provider_config = ProviderConfig(name=provider, model=model or "gpt-4o") 
        
               extractor_config = ExtractorConfig(name=extractor, provider=provider_config) 
        
               chunker_config = ChunkerConfig(name=chunker)

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

- Move tests to unit/integration subdirectories for better organization - Add comprehensive integration test suites for chunkers, CLI, extractors, pipelines, providers, and schemas - Add new unit tests for core artifacts, config, merge, and schema utilities - Add CLAUDE.md with detailed architecture documentation for AI assistants - Add get_default_model_for_provider() function with provider-specific default models - Update CLI commands and extract_simple to use provider-specific model defaults

- Removed `schema_splitter.py` as it is no longer needed. - Updated type hints across various modules to use `list` and `dict` instead of `List` and `Dict`. - Improved logging setup in `logging.py` to encourage user configuration. - Enhanced metrics event container in `metrics.py` with updated type hints. - Refined business rule validation logic in `business_rules.py` for better clarity. - Adjusted consistency validation to use updated type hints. - Updated plan validation logic in `plan_validator.py` for better error handling and clarity. - Improved schema validation logic in `schema_validator.py` for better error reporting. - Updated `requirements.txt` to reflect new dependency versions. - Revised test suite documentation to provide clearer instructions and updated test counts. - Added new unit tests for hybrid chunker and Textract extractor. - Enhanced parallel processing tests to improve error handling and result validation. - Improved schema utility tests to cover array schemas and structured output.

devin-ai-integration

Devin Review found 3 potential issues.

View 7 additional findings in Devin Review.

devin-ai-integration · 2026-03-20T16:57:41Z

+    # Spreadsheets are binary formats, not textual
+    ".xls", ".xlsx",
 }


🔴 Excel (.xls/.xlsx) reclassification makes text extraction dead code, routes files to PDF conversion instead

Moving .xls and .xlsx from _TEXTUAL_EXTS to _UNSUPPORTED_BUT_ACCEPTABLE_AS_BINARY in nextract/mimetypes_map.py causes is_textual() to return False for these extensions. However, _prepare_single_file in nextract/files.py:308-330 still has specialized Excel→TSV/CSV text extraction logic gated by is_textual(path). Since that guard now evaluates to False, the entire Excel text-extraction branch (.xlsx→TSV via _xlsx_to_text, .xls→CSV via _xls_to_text_via_cli) is unreachable dead code. Instead, Excel files now fall through to is_office_binary() at nextract/files.py:383, which attempts Office→PDF conversion — a much worse outcome for spreadsheets that were previously extracted as structured text.

Prompt for agents

The root issue is in nextract/mimetypes_map.py lines 36-41: .xls and .xlsx were moved from _TEXTUAL_EXTS to _UNSUPPORTED_BUT_ACCEPTABLE_AS_BINARY. But nextract/files.py _prepare_single_file (lines 308-330) has specialized Excel text-extraction code gated by is_textual(path), which now always returns False for these extensions. Either: (1) Move .xls/.xlsx back to _TEXTUAL_EXTS and remove them from _UNSUPPORTED_BUT_ACCEPTABLE_AS_BINARY, OR (2) Update _prepare_single_file in files.py to check for Excel extensions before the is_textual() guard (e.g. add an explicit check like `if path.suffix.lower() in {".xlsx", ".xls"}:` before line 308).

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-20T16:57:42Z

+            if strategy == "union":
+                if values:
+                    # For list values, merge all lists
+                    merged_list = []
+                    for v in values:
+                        if isinstance(v, list):
+                            merged_list.extend(v)
+                        elif v:
+                            merged_list.append(v)
+                    merged[field_name] = merged_list if merged_list else values[0]


🔴 Multipass 'union' merge strategy converts scalar fields to lists, breaking schema validation

The new union merge strategy in _merge_results wraps all field values into a list via merged_list.append(v). For scalar fields (strings, numbers), this converts e.g. {"name": "John"} across two passes into {"name": ["John", "John"]} instead of keeping it as {"name": "John"}. This breaks JSON Schema validation for any non-array field. The old behavior treated union as equivalent to first_non_empty (take first non-empty value), which preserved scalar types.

Example of the breakage

Given two passes both returning {"name": "John", "total": 500} with a schema {"properties": {"name": {"type": "string"}, "total": {"type": "number"}}}, the union merge now produces {"name": ["John", "John"], "total": [500, 500]} — both fields are now lists, violating the schema.

Suggested change

if strategy == "union":

if values:

# For list values, merge all lists

merged_list = []

for v in values:

if isinstance(v, list):

merged_list.extend(v)

elif v:

merged_list.append(v)

merged[field_name] = merged_list if merged_list else values[0]

if strategy == "union":

if values:

# For list values, merge all lists; for scalars, take first

if any(isinstance(v, list) for v in values):

merged_list = []

for v in values:

if isinstance(v, list):

merged_list.extend(v)

elif v is not None:

merged_list.append(v)

merged[field_name] = merged_list if merged_list else values[0]

else:

merged[field_name] = values[0]

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-20T16:57:43Z

+                        elif v:
+                            merged_list.append(v)


🔴 Multipass 'union' merge drops falsy values like 0 and False

In the union merge strategy at nextract/multipass.py:288, the condition elif v: uses a truthiness check. This silently drops legitimate falsy values such as 0, 0.0, and False from the merged output. For example, a field with "total": 0 would pass the earlier filter at line 277 (value is not None and value != "" and value != []) but then be skipped by the elif v: guard, causing data loss.

Suggested change

elif v:

merged_list.append(v)

elif v is not None:

merged_list.append(v)

Was this helpful? React with 👍 or 👎 to provide feedback.

Audio and video files previously failed in chunkers — SemanticChunker returned empty results and PageBasedChunker raised ValueError. Now all chunkers produce a single passthrough DocumentChunk for media files, allowing them to flow through the pipeline to providers that support audio/video natively (e.g., Gemini).

saksham-nexla added 2 commits December 20, 2025 05:56

saksham-nexla requested a review from ag-nexla December 22, 2025 16:51

saksham-nexla added 2 commits December 23, 2025 20:31

devin-ai-integration Bot reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Architecture#2

New Architecture#2
saksham-nexla wants to merge 5 commits into
mainfrom
chore/new-arch

saksham-nexla commented Dec 20, 2025 •

edited by devin-ai-integration Bot

Loading

Uh oh!

saksham-nexla commented Dec 23, 2025

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot Mar 20, 2026

Uh oh!

devin-ai-integration Bot Mar 20, 2026

Uh oh!

devin-ai-integration Bot Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saksham-nexla commented Dec 20, 2025 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saksham-nexla commented Dec 23, 2025

Code review

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

saksham-nexla commented Dec 20, 2025 •

edited by devin-ai-integration Bot

Loading