Skip to content

New Architecture#2

Open
saksham-nexla wants to merge 5 commits into
mainfrom
chore/new-arch
Open

New Architecture#2
saksham-nexla wants to merge 5 commits into
mainfrom
chore/new-arch

Conversation

@saksham-nexla
Copy link
Copy Markdown
Member

@saksham-nexla saksham-nexla commented Dec 20, 2025

- Updated the .gitignore to include usage/ directory.
- Refactored README.md to focus on mypy usage and streamlined content.
- Introduced a new Product Requirements Document (prd.md) detailing the architecture and features of nextract V2.
- Added multiple new modules for chunking, including fixed size, semantic, and hybrid chunkers.
- Implemented a CLI for document extraction with commands for batch processing, schema suggestion, and configuration validation.
- Established a new telemetry system for logging and metrics tracking.
- Enhanced the core extraction logic and validation mechanisms.
- Added comprehensive tests for the new features and improved existing test coverage.
- Introduced a new theme option in the CLI for HTML output, allowing users to select between light, dark, or system themes.
- Updated the HTML formatter to apply the selected theme, including dynamic CSS for theme-specific styling.
- Enhanced the HTML template structure to support theme switching and improved overall layout for better user experience.
@saksham-nexla
Copy link
Copy Markdown
Member Author

Code review

Found 1 issue:

  1. Hardcoded "gpt-4o" model name breaks non-OpenAI providers (bug in extract_simple and CLI commands)

The extract_simple() function defaults to model="gpt-4o" regardless of which provider is specified. This will cause runtime errors when users call extract_simple(document, schema, provider="anthropic") or any non-OpenAI provider without explicitly providing a model parameter. The model name "gpt-4o" is OpenAI-specific and will fail when passed to other providers like Anthropic, Google, Cohere, etc. The same bug appears in CLI commands suggest_schema.py and batch.py.

def extract_simple(
document: str,
schema: dict,
provider: str,
model: str | None = None,
prompt: str | None = None,
) -> Any:
"""Simplest extraction helper using a default text pipeline."""
provider_config = ProviderConfig(name=provider, model=model or "gpt-4o")
extractor_config = ExtractorConfig(name="text", provider=provider_config)
chunker_config = ChunkerConfig(name="semantic")
plan = ExtractionPlan(extractor=extractor_config, chunker=chunker_config)
pipeline = ExtractionPipeline(plan)
return pipeline.extract(document=document, schema=schema, prompt=prompt)

@app.command("suggest-schema")
def cli_suggest_schema(
samples: list[Path] = typer.Argument(..., exists=True, readable=True),
prompt: str = typer.Option(..., "--prompt", "-p"),
provider: str = typer.Option("openai", "--provider"),
model: Optional[str] = typer.Option(None, "--model"),
examples: Optional[Path] = typer.Option(None, "--examples"),
output: Optional[Path] = typer.Option(None, "--output", "-o"),
) -> None:
provider_config = ProviderConfig(name=provider, model=model or "gpt-4o")
generator = SchemaGenerator(provider=provider_config)

@app.command("batch")
def cli_batch(
documents: list[Path] = typer.Argument(..., exists=True, readable=True),
schema: Path = typer.Option(..., "--schema", "-s"),
prompt: Optional[str] = typer.Option(None, "--prompt", "-p"),
extractor: str = typer.Option("text", "--extractor"),
provider: str = typer.Option("openai", "--provider"),
model: Optional[str] = typer.Option(None, "--model"),
chunker: str = typer.Option("semantic", "--chunker"),
max_workers: int = typer.Option(4, "--max-workers"),
include_extra: bool = typer.Option(False, "--include-extra"),
num_passes: int = typer.Option(1, "--num-passes"),
enable_suggestions: bool = typer.Option(False, "--enable-suggestions"),
) -> None:
schema_obj = _load_schema(schema)
provider_config = ProviderConfig(name=provider, model=model or "gpt-4o")
extractor_config = ExtractorConfig(name=extractor, provider=provider_config)
chunker_config = ChunkerConfig(name=chunker)

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

- Move tests to unit/integration subdirectories for better organization
- Add comprehensive integration test suites for chunkers, CLI, extractors, pipelines, providers, and schemas
- Add new unit tests for core artifacts, config, merge, and schema utilities
- Add CLAUDE.md with detailed architecture documentation for AI assistants
- Add get_default_model_for_provider() function with provider-specific default models
- Update CLI commands and extract_simple to use provider-specific model defaults
- Removed `schema_splitter.py` as it is no longer needed.
- Updated type hints across various modules to use `list` and `dict` instead of `List` and `Dict`.
- Improved logging setup in `logging.py` to encourage user configuration.
- Enhanced metrics event container in `metrics.py` with updated type hints.
- Refined business rule validation logic in `business_rules.py` for better clarity.
- Adjusted consistency validation to use updated type hints.
- Updated plan validation logic in `plan_validator.py` for better error handling and clarity.
- Improved schema validation logic in `schema_validator.py` for better error reporting.
- Updated `requirements.txt` to reflect new dependency versions.
- Revised test suite documentation to provide clearer instructions and updated test counts.
- Added new unit tests for hybrid chunker and Textract extractor.
- Enhanced parallel processing tests to improve error handling and result validation.
- Improved schema utility tests to cover array schemas and structured output.
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 7 additional findings in Devin Review.

Open in Devin Review

Comment thread nextract/mimetypes_map.py
Comment on lines +39 to 41
# Spreadsheets are binary formats, not textual
".xls", ".xlsx",
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Excel (.xls/.xlsx) reclassification makes text extraction dead code, routes files to PDF conversion instead

Moving .xls and .xlsx from _TEXTUAL_EXTS to _UNSUPPORTED_BUT_ACCEPTABLE_AS_BINARY in nextract/mimetypes_map.py causes is_textual() to return False for these extensions. However, _prepare_single_file in nextract/files.py:308-330 still has specialized Excel→TSV/CSV text extraction logic gated by is_textual(path). Since that guard now evaluates to False, the entire Excel text-extraction branch (.xlsx→TSV via _xlsx_to_text, .xls→CSV via _xls_to_text_via_cli) is unreachable dead code. Instead, Excel files now fall through to is_office_binary() at nextract/files.py:383, which attempts Office→PDF conversion — a much worse outcome for spreadsheets that were previously extracted as structured text.

Prompt for agents
The root issue is in nextract/mimetypes_map.py lines 36-41: .xls and .xlsx were moved from _TEXTUAL_EXTS to _UNSUPPORTED_BUT_ACCEPTABLE_AS_BINARY. But nextract/files.py _prepare_single_file (lines 308-330) has specialized Excel text-extraction code gated by is_textual(path), which now always returns False for these extensions. Either: (1) Move .xls/.xlsx back to _TEXTUAL_EXTS and remove them from _UNSUPPORTED_BUT_ACCEPTABLE_AS_BINARY, OR (2) Update _prepare_single_file in files.py to check for Excel extensions before the is_textual() guard (e.g. add an explicit check like `if path.suffix.lower() in {".xlsx", ".xls"}:` before line 308).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread nextract/multipass.py
Comment on lines +281 to +290
if strategy == "union":
if values:
# For list values, merge all lists
merged_list = []
for v in values:
if isinstance(v, list):
merged_list.extend(v)
elif v:
merged_list.append(v)
merged[field_name] = merged_list if merged_list else values[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Multipass 'union' merge strategy converts scalar fields to lists, breaking schema validation

The new union merge strategy in _merge_results wraps all field values into a list via merged_list.append(v). For scalar fields (strings, numbers), this converts e.g. {"name": "John"} across two passes into {"name": ["John", "John"]} instead of keeping it as {"name": "John"}. This breaks JSON Schema validation for any non-array field. The old behavior treated union as equivalent to first_non_empty (take first non-empty value), which preserved scalar types.

Example of the breakage

Given two passes both returning {"name": "John", "total": 500} with a schema {"properties": {"name": {"type": "string"}, "total": {"type": "number"}}}, the union merge now produces {"name": ["John", "John"], "total": [500, 500]} — both fields are now lists, violating the schema.

Suggested change
if strategy == "union":
if values:
# For list values, merge all lists
merged_list = []
for v in values:
if isinstance(v, list):
merged_list.extend(v)
elif v:
merged_list.append(v)
merged[field_name] = merged_list if merged_list else values[0]
if strategy == "union":
if values:
# For list values, merge all lists; for scalars, take first
if any(isinstance(v, list) for v in values):
merged_list = []
for v in values:
if isinstance(v, list):
merged_list.extend(v)
elif v is not None:
merged_list.append(v)
merged[field_name] = merged_list if merged_list else values[0]
else:
merged[field_name] = values[0]
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread nextract/multipass.py
Comment on lines +288 to +289
elif v:
merged_list.append(v)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Multipass 'union' merge drops falsy values like 0 and False

In the union merge strategy at nextract/multipass.py:288, the condition elif v: uses a truthiness check. This silently drops legitimate falsy values such as 0, 0.0, and False from the merged output. For example, a field with "total": 0 would pass the earlier filter at line 277 (value is not None and value != "" and value != []) but then be skipped by the elif v: guard, causing data loss.

Suggested change
elif v:
merged_list.append(v)
elif v is not None:
merged_list.append(v)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Audio and video files previously failed in chunkers — SemanticChunker
returned empty results and PageBasedChunker raised ValueError. Now all
chunkers produce a single passthrough DocumentChunk for media files,
allowing them to flow through the pipeline to providers that support
audio/video natively (e.g., Gemini).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant