Skip to content

feat(rag): embedding model support & rag improvements#192

Merged
vishnurk6247 merged 5 commits into
developfrom
feat/rag_improvements
Dec 19, 2025
Merged

feat(rag): embedding model support & rag improvements#192
vishnurk6247 merged 5 commits into
developfrom
feat/rag_improvements

Conversation

@vizsatiz
Copy link
Copy Markdown
Member

@vizsatiz vizsatiz commented Dec 16, 2025

Summary by CodeRabbit

  • New Features

    • Document-type classification (PDF, Image, Text) added across ingestion and storage flows.
    • Model-type selection ("llm" vs "embedding") supported end-to-end: UI forms, validation, service/API handling, and persisted config storage; forms default to "llm".
    • Database migration adds model_type column with default "llm".
  • Chores

    • Added Docker Compose configuration for local PostgreSQL and Redis.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Dec 16, 2025

Walkthrough

Adds a typed DocumentType enum and updates RAG ingestion code to use it; introduces persistent model_type across DB, service, controller, schemas, client UI and types; adds a docker-compose for Postgres (pgvector) and Redis; adds an Alembic migration and model field for LLM config model_type.

Changes

Cohort / File(s) Summary
RAG ingestion — typed document handling
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/models/doc_content.py, wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py, wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/kb_storage_processor.py, wavefront/server/background_jobs/rag_ingestion/rag_ingestion/service/kb_rag_storage.py
Introduces DocumentType enum (PDF, IMAGE, TEXT). FileProcessor.process_file now returns `(str
LLM inference — model_type propagation (server)
wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2025_12_16_1406-f7572bcd9510_embedding_models.py, wavefront/server/modules/db_repo_module/db_repo_module/models/llm_inference_config.py, wavefront/server/modules/llm_inference_config_module/llm_inference_config_module/controllers/llm_inference_config_controller.py, wavefront/server/modules/llm_inference_config_module/llm_inference_config_module/models/schemas.py, wavefront/server/modules/llm_inference_config_module/llm_inference_config_module/services/llm_inference_config_service.py
Adds non-nullable model_type column (String(64), server_default 'llm') via Alembic and model update. Controller and payload schemas accept and validate model_type; service.create_config gains model_type parameter and forwards it to repository.
Client UI and types — model_type field
wavefront/client/src/pages/apps/[appId]/llm-inference/CreateLLMInferenceDialog.tsx, wavefront/client/src/pages/apps/[appId]/llm-inference/[configId].tsx, wavefront/client/src/types/llm-inference-config.ts
Adds modelType/model_type to forms and schemas with default 'llm', includes the field in create/update payloads, adds a UI select for Model Type, and adds optional model_type? to TypeScript interfaces.
Local dev infra
wavefront/server/docker-compose.yml
Adds docker-compose with Postgres (ankane/pgvector) and Redis services, volume and bridge network for local development.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Areas that may need extra attention:
    • wavefront/server/background_jobs/rag_ingestion/.../processors/file_processor.py — PDF extraction flow, temp-file lifecycle, error wrapping and logging.
    • All RAG ingestion dataclass and call sites — ensure no remaining parse_type strings and imports match DocumentType.
    • DB migration and model change — server_default syntax, nullable handling, and existing repository code that creates/updates rows.
    • Controller/service/schema/client wiring — consistent naming and mapping between model_type (server) and modelType (client), validation boundaries.
    • Client form behavior — default/reset/load flows to ensure modelType persists and API payloads match backend expectations.

"A rabbit hopped through typed terrain,
enums and migrations in its train.
Docker hummed, forms chose a type,
PDFs, images, texts take flight.
Tiny paws, big changes — code and grain." 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: adding embedding model support (via model_type field and DocumentType enum) and RAG improvements (refactoring RAG ingestion pipeline with DocumentType).
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/rag_improvements

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
wavefront/server/modules/llm_inference_config_module/llm_inference_config_module/models/schemas.py (1)

66-75: Add model_type field to response schema.

The LlmInferenceConfigResponse schema is missing the model_type field. Since the ORM model's to_dict() method includes this field, it should be declared in the response schema for proper API documentation and type validation.

Apply this diff:

 class LlmInferenceConfigResponse(BaseModel):
     id: uuid.UUID
     llm_model: str
     display_name: str
     type: str
+    model_type: str
     base_url: Optional[str]
     parameters: Optional[Dict[str, Any]]
     is_deleted: bool
     created_at: datetime
     updated_at: datetime
wavefront/server/modules/llm_inference_config_module/llm_inference_config_module/controllers/llm_inference_config_controller.py (1)

63-72: Add validation for model_type in create path.

The create endpoint passes payload.model_type directly to the service without validation, while the update endpoint (lines 234-249) validates that model_type must be either "llm" or "embedding". This inconsistency could allow invalid model_type values to be stored during creation.

Apply this diff to add validation before line 64:

+        # Validate model_type if provided
+        if payload.model_type is not None and payload.model_type not in ['llm', 'embedding']:
+            return JSONResponse(
+                status_code=status.HTTP_400_BAD_REQUEST,
+                content=response_formatter.buildErrorResponse(
+                    'Invalid model_type value. Must be "llm" or "embedding"'
+                ),
+            )
+
         config_dict = await llm_inference_config_service.create_config(
🧹 Nitpick comments (4)
wavefront/server/modules/llm_inference_config_module/llm_inference_config_module/models/schemas.py (1)

31-33: Consider using an enum for model_type values.

Currently, model_type is defined as a string with only documentation indicating allowed values ("llm" or "embedding"). Following the pattern used for InferenceEngineType, consider defining a ModelType enum to provide type safety and prevent invalid values at the schema level.

Example implementation:

class ModelType(str, Enum):
    LLM = 'llm'
    EMBEDDING = 'embedding'

Then update the schemas:

-    model_type: Optional[str] = Field(
-        'llm', description='Type of model: "llm" or "embedding" (defaults to "llm")'
+    model_type: Optional[ModelType] = Field(
+        ModelType.LLM, description='Type of model (defaults to "llm")'
     )

Also applies to: 55-57

wavefront/server/docker-compose.yml (1)

21-21: Pin the Redis image to the latest stable patch version.

The latest patch version available for Redis 7.4 is 7.4.7. Using redis:7.4 without a patch version may lead to inconsistencies across environments. Redis 7.4.x versions are licensed under the dual RSALv2 or SSPLv1 license.

Apply this diff:

-    image: redis:7.4
+    image: redis:7.4.7
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py (1)

27-46: Refactor exception handling for better error context.

The PDF text extraction logic is functional, but the exception handling could be improved:

  1. Missing exception chaining (line 43): Use raise ... from e to preserve the original error context for debugging.
  2. Return placement (line 39): Consider moving the return statement to an else block for better code structure.

Apply this diff to improve exception handling:

             try:
                 text_content = textract.process(
                     temp_file_path, method='pdfminer'
                 ).decode('utf-8')
-                return text_content, DocumentType.PDF
 
             except Exception as e:
                 logger.error(f'Text extraction failed for {mime_type}: {e}')
-                raise RuntimeError(f'Text extraction failed for {mime_type}: {e}')
+                raise RuntimeError(f'Text extraction failed for {mime_type}: {e}') from e
 
             finally:
                 os.unlink(temp_file_path)
+            
+            return text_content, DocumentType.PDF
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/kb_storage_processor.py (1)

39-55: Update the docstring to reflect the new field name.

The method correctly uses the new FileProcessor API and handles the DocumentType properly. However, the docstring still references the old field name:

  • Line 43: "Extracts text content from a message based on its parse_type and file_type."
  • Line 50: "A DocContent object with extracted content and parse_type."

Apply this diff to update the docstring:

     async def _extract_content(
         self, message: QueueMessage, file_content: bytes
     ) -> DocContent:
         """
-        Extracts text content from a message based on its parse_type and file_type.
+        Extracts content from a message based on its file_type.
 
         Args:
             message: An object with 'parse_type' and 'file_type' attributes.
             file_content: The binary content of the file.
 
         Returns:
-            A DocContent object with extracted content and parse_type.
+            A DocContent object with extracted content and document_type.
         """
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 027db96 and cba19c2.

📒 Files selected for processing (10)
  • wavefront/server/background_jobs/rag_ingestion/rag_ingestion/models/doc_content.py (1 hunks)
  • wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py (1 hunks)
  • wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/kb_storage_processor.py (6 hunks)
  • wavefront/server/background_jobs/rag_ingestion/rag_ingestion/service/kb_rag_storage.py (1 hunks)
  • wavefront/server/docker-compose.yml (1 hunks)
  • wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2025_12_16_1406-f7572bcd9510_embedding_models.py (1 hunks)
  • wavefront/server/modules/db_repo_module/db_repo_module/models/llm_inference_config.py (1 hunks)
  • wavefront/server/modules/llm_inference_config_module/llm_inference_config_module/controllers/llm_inference_config_controller.py (4 hunks)
  • wavefront/server/modules/llm_inference_config_module/llm_inference_config_module/models/schemas.py (2 hunks)
  • wavefront/server/modules/llm_inference_config_module/llm_inference_config_module/services/llm_inference_config_service.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/models/doc_content.py (1)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py (1)
  • DocumentType (9-12)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/service/kb_rag_storage.py (2)
wavefront/server/modules/knowledge_base_module/knowledge_base_module/embeddings/embed.py (1)
  • KnowledgeBaseEmbeddingObject (8-11)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py (1)
  • DocumentType (9-12)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py (1)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/kb_storage_processor.py (1)
  • process (106-138)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/kb_storage_processor.py (3)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/service/kb_rag_storage.py (3)
  • KBRagStorage (30-432)
  • process_document (305-339)
  • EmbeddingsToStore (23-27)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/models/doc_content.py (1)
  • DocContent (7-11)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py (2)
  • DocumentType (9-12)
  • process_file (16-46)
wavefront/server/modules/llm_inference_config_module/llm_inference_config_module/controllers/llm_inference_config_controller.py (1)
wavefront/client/server.cjs (1)
  • config (24-28)
🪛 Ruff (0.14.8)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py

39-39: Consider moving this statement to an else block

(TRY300)


41-41: Do not catch blind exception: Exception

(BLE001)


43-43: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


43-43: Avoid specifying long messages outside the exception class

(TRY003)


56-56: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (13)
wavefront/server/modules/db_repo_module/db_repo_module/alembic/versions/2025_12_16_1406-f7572bcd9510_embedding_models.py (1)

22-35: LGTM!

The migration correctly adds a non-nullable model_type column with a server_default='llm' to handle existing rows, and the downgrade properly reverses the change.

wavefront/server/modules/llm_inference_config_module/llm_inference_config_module/controllers/llm_inference_config_controller.py (2)

115-121: LGTM!

The embedding filter correctly uses .get('model_type') which safely handles cases where the field might be missing from older configs, and properly filters for embedding-type models.


234-249: LGTM!

The validation logic for model_type in the update path is thorough, checking for null values and ensuring only "llm" or "embedding" are accepted. This provides good guardrails against invalid data.

wavefront/server/modules/llm_inference_config_module/llm_inference_config_module/services/llm_inference_config_service.py (1)

37-97: LGTM!

The service layer correctly accepts the model_type parameter with an appropriate default of 'llm', includes it in logging for observability, and properly propagates it through to the repository layer.

wavefront/server/docker-compose.yml (1)

9-11: Hardcoded credentials are acceptable for local development.

The hardcoded PostgreSQL credentials (postgres/postgres) are suitable for local development but should never be used in production environments. Consider documenting this in a README or adding a comment in the file.

wavefront/server/background_jobs/rag_ingestion/rag_ingestion/service/kb_rag_storage.py (2)

6-9: LGTM! Import additions are properly used.

All newly added imports (ast, numpy, logger, datetime, EmbeddingFunc, DocumentType) are appropriately utilized throughout the file.

Also applies to: 17-19


22-28: LGTM! Type safety improvement with DocumentType enum.

The change from str to DocumentType for the file_type field provides better type safety and aligns with the PR's refactoring objectives.

wavefront/server/background_jobs/rag_ingestion/rag_ingestion/models/doc_content.py (1)

3-11: LGTM! Improved field naming and type safety.

The changes enhance both clarity and type safety:

  • parse_typedocument_type is more descriptive
  • Using DocumentType enum instead of str prevents invalid values
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py (2)

9-12: LGTM! Well-defined DocumentType enum.

The enum provides a clear, type-safe way to classify document types throughout the RAG pipeline.


15-25: LGTM! Clean handling for TEXT and IMAGE types.

The method signature change and the handling logic for TEXT (decode to UTF-8) and IMAGE (raw bytes) are correct and straightforward.

wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/kb_storage_processor.py (3)

3-3: LGTM! Consistent DocumentType integration.

The import additions and type changes in KbStorageInsights are consistent with the PR's refactoring objectives.

Also applies to: 11-11, 18-23


57-104: LGTM! Correct DocumentType enum usage throughout.

The method correctly uses DocumentType enum values for conditional logic and properly passes document_type to EmbeddingsToStore.


106-138: LGTM! Proper integration of refactored method.

The process method correctly uses the renamed _extract_content method and properly extracts document_type from the returned DocContent.

Comment thread wavefront/server/docker-compose.yml
Comment thread wavefront/server/docker-compose.yml
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
wavefront/client/src/types/llm-inference-config.ts (1)

10-10: Consider using a union type for model_type to match validation constraints.

The server validates model_type as "llm" | "embedding", and the UI schemas use z.enum(['llm', 'embedding']). Using string here loses that type safety at the interface boundary.

+export type ModelType = 'llm' | 'embedding';
+
 export interface LLMInferenceConfig {
   id: string;
   llm_model: string;
   display_name: string;
   type: InferenceEngineType;
-  model_type?: string;
+  model_type?: ModelType;
   base_url?: string;
   ...

Apply the same change to CreateLLMConfigRequest (line 23) and UpdateLLMConfigRequest (line 33).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cba19c2 and ae7b1de.

📒 Files selected for processing (4)
  • wavefront/client/src/pages/apps/[appId]/llm-inference/CreateLLMInferenceDialog.tsx (5 hunks)
  • wavefront/client/src/pages/apps/[appId]/llm-inference/[configId].tsx (6 hunks)
  • wavefront/client/src/types/llm-inference-config.ts (3 hunks)
  • wavefront/server/modules/llm_inference_config_module/llm_inference_config_module/controllers/llm_inference_config_controller.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
wavefront/client/src/pages/apps/[appId]/llm-inference/CreateLLMInferenceDialog.tsx (1)
wavefront/client/src/components/ui/select.tsx (5)
  • Select (133-133)
  • SelectTrigger (136-136)
  • SelectValue (135-135)
  • SelectContent (137-137)
  • SelectItem (139-139)
🔇 Additional comments (12)
wavefront/server/modules/llm_inference_config_module/llm_inference_config_module/controllers/llm_inference_config_controller.py (1)

225-240: LGTM!

The validation logic for model_type in the update endpoint is well-implemented, following the same pattern as the type field validation. Error messages are clear and the allowed values are explicitly checked.

wavefront/client/src/pages/apps/[appId]/llm-inference/CreateLLMInferenceDialog.tsx (5)

74-74: LGTM!

The schema correctly uses z.enum(['llm', 'embedding']) for strict validation, matching the server-side allowed values.


103-111: LGTM!

Default values correctly initialize modelType to 'llm', which is a sensible default for backward compatibility.


132-142: LGTM!

The reset behavior correctly resets modelType to 'llm' when the dialog closes, maintaining consistency with the default values.


172-180: LGTM!

The payload correctly maps data.modelType to model_type in the API request, following the codebase's convention of camelCase in the client and snake_case for the API.


373-397: LGTM!

The Model Type field UI is well-implemented:

  • Consistent styling with other form fields
  • Clear labeling with required indicator
  • Helpful description explaining the purpose
  • Select options match the schema enum values
wavefront/client/src/pages/apps/[appId]/llm-inference/[configId].tsx (6)

51-51: LGTM!

Schema correctly validates model_type as an enum of 'llm' or 'embedding', consistent with the create dialog.


74-83: LGTM!

Default values correctly initialize model_type to 'llm'.


92-100: LGTM!

The form reset correctly handles model_type with a fallback to 'llm' for backward compatibility with existing records that may not have this field set.


136-142: LGTM!

The model_type is correctly included in the update payload.


245-257: LGTM!

The cancel handler correctly resets model_type from the config with the same fallback pattern used in the initial load, ensuring consistency.


377-398: LGTM!

The Model Type field UI is well-implemented:

  • Consistent with the create dialog implementation
  • Properly disabled when not in edit mode
  • Select options match the schema enum values

…mplicit (fall through) returns'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py (1)

51-59: Critical: The application/ MIME mapping issue from the previous review remains unaddressed.*

As flagged in the previous review, mapping all application/* MIME types to DocumentType.PDF is problematic. Non-PDF formats like application/json, application/xml, application/zip, etc. will fail during PDF text extraction, causing runtime errors.

The previous review provided a specific fix that should be applied:

     def extract_document_type(self, file_type: str) -> DocumentType:
         if file_type.startswith('text/plain'):
             return DocumentType.TEXT
         if file_type.startswith('image/'):
             return DocumentType.IMAGE
-        if file_type.startswith('application/'):
+        if file_type in ('application/pdf', 'application/x-pdf'):
             return DocumentType.PDF
         else:
             raise ValueError(f'Unsupported file type: {file_type}')

This ensures only actual PDF MIME types are processed as PDFs, while other application/* types are properly rejected with a clear error message.

🧹 Nitpick comments (1)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py (1)

41-43: Refine exception handling for better error context.

The current exception handling could be improved to preserve the exception chain and be more specific:

Apply this diff to improve exception handling:

-            except Exception as e:
+            except (textract.exceptions.ShellError, UnicodeDecodeError) as e:
                 logger.error(f'Text extraction failed for {mime_type}: {e}')
-                raise RuntimeError(f'Text extraction failed for {mime_type}: {e}')
+                raise RuntimeError(f'Text extraction failed for {mime_type}') from e

This change:

  • Catches specific exceptions instead of blind Exception
  • Uses raise...from to preserve the exception chain for better debugging
  • Avoids duplicating error details in the message (they're in the chained exception)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae7b1de and 6ce989a.

📒 Files selected for processing (1)
  • wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py

39-39: Consider moving this statement to an else block

(TRY300)


41-41: Do not catch blind exception: Exception

(BLE001)


43-43: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


43-43: Avoid specifying long messages outside the exception class

(TRY003)


49-49: Avoid specifying long messages outside the exception class

(TRY003)


59-59: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (3)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py (3)

4-6: LGTM!

The new imports are appropriate and properly utilized throughout the file.


9-12: LGTM!

The DocumentType enum provides clear type safety and improves code maintainability. The string values align well with document processing needs.


16-49: Good defensive programming with the explicit raise.

The explicit raise at line 48-49 addresses the past review concern about mixing implicit and explicit returns. While technically unreachable (all DocumentType enum values are handled), it serves as a safety net against future enum additions.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py (1)

19-20: Remove redundant variable assignment.

Line 19 assigns mime_type = file_type but then only uses mime_type once on line 20. The file_type parameter can be passed directly to extract_document_type.

Apply this diff:

-        mime_type = file_type
-        document_type = self.extract_document_type(mime_type)
+        document_type = self.extract_document_type(file_type)

Also update the error messages on lines 42 and 43 to use file_type instead of mime_type.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6ce989a and 9fab565.

📒 Files selected for processing (2)
  • wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py (1 hunks)
  • wavefront/server/modules/db_repo_module/db_repo_module/models/llm_inference_config.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • wavefront/server/modules/db_repo_module/db_repo_module/models/llm_inference_config.py
🧰 Additional context used
🧬 Code graph analysis (1)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py (1)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/kb_storage_processor.py (1)
  • process (106-138)
🪛 Ruff (0.14.8)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py

39-39: Consider moving this statement to an else block

(TRY300)


41-41: Do not catch blind exception: Exception

(BLE001)


43-43: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


43-43: Avoid specifying long messages outside the exception class

(TRY003)


49-49: Avoid specifying long messages outside the exception class

(TRY003)


59-59: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (3)
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py (3)

9-12: LGTM! Clean enum design for document type safety.

The DocumentType enum provides clear type safety for document processing and aligns well with the typed approach mentioned in the PR objectives.


48-49: Good defensive programming - past review addressed.

The explicit raise at the end properly addresses the previous review concern about mixing explicit and implicit returns. While technically unreachable under the current enum values, it serves as a safety net if DocumentType is extended in the future.


56-57: Past review properly addressed.

The refined MIME type check now specifically targets PDF types (application/pdf and application/x-pdf) rather than all application/* types, correctly addressing the previous review concern.

Comment on lines 41 to 43
except Exception as e:
# Re-raise processing errors
logger.error(f'Text extraction failed for {mime_type}: {e}')
raise RuntimeError(f'Text extraction failed for {mime_type}: {e}')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Improve exception handling with chaining and specific types.

The exception handling has two issues:

  1. Catching blind Exception is too broad and may mask unexpected errors
  2. The re-raised RuntimeError lacks from e, which loses the original traceback

Apply this diff:

-            except Exception as e:
+            except (textract.exceptions.ShellError, UnicodeDecodeError, OSError) as e:
                 logger.error(f'Text extraction failed for {mime_type}: {e}')
-                raise RuntimeError(f'Text extraction failed for {mime_type}: {e}')
+                raise RuntimeError(f'Text extraction failed for {mime_type}: {e}') from e

Note: Verify the specific exception types raised by textract if textract.exceptions.ShellError is not available.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 Ruff (0.14.8)

41-41: Do not catch blind exception: Exception

(BLE001)


43-43: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


43-43: Avoid specifying long messages outside the exception class

(TRY003)

Comment on lines +51 to +59
def extract_document_type(self, file_type: str) -> DocumentType:
if file_type.startswith('text/plain'):
return DocumentType.TEXT
if file_type.startswith('image/'):
return DocumentType.IMAGE
if file_type in ('application/pdf', 'application/x-pdf'):
return DocumentType.PDF
else:
raise ValueError(f'Unsupported file type: {mime_type}')
raise ValueError(f'Unsupported file type: {file_type}')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Implement case-insensitive MIME type matching.

MIME types are case-insensitive per RFC 2045, but the current implementation uses case-sensitive startswith() and in checks. This could reject valid uploads if MIME types arrive with different casing (e.g., Text/Plain, IMAGE/PNG, Application/PDF).

Apply this diff:

     def extract_document_type(self, file_type: str) -> DocumentType:
-        if file_type.startswith('text/plain'):
+        file_type_lower = file_type.lower()
+        if file_type_lower.startswith('text/plain'):
             return DocumentType.TEXT
-        if file_type.startswith('image/'):
+        if file_type_lower.startswith('image/'):
             return DocumentType.IMAGE
-        if file_type in ('application/pdf', 'application/x-pdf'):
+        if file_type_lower in ('application/pdf', 'application/x-pdf'):
             return DocumentType.PDF
         else:
             raise ValueError(f'Unsupported file type: {file_type}')
🧰 Tools
🪛 Ruff (0.14.8)

59-59: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In
wavefront/server/background_jobs/rag_ingestion/rag_ingestion/processors/file_processor.py
around lines 51 to 59, the MIME type checks are case-sensitive and can fail for
differently-cased values; normalize the incoming file_type to a canonical
lower-case string (and defensively handle None or empty values) before
performing startswith() or in checks, then use the same lower-case comparisons
for 'text/plain', 'image/', and the PDF MIME variants so matching is
case-insensitive.

@vishnurk6247 vishnurk6247 merged commit 20ee522 into develop Dec 19, 2025
8 checks passed
@vishnurk6247 vishnurk6247 deleted the feat/rag_improvements branch December 19, 2025 07:19
thomastomy5 pushed a commit that referenced this pull request Apr 27, 2026
* feat(rag): minor cleanup on rag processing worker

* Add support for gemini embeddings

* fix for frontend to support model types

* Potential fix for pull request finding 'Explicit returns mixed with implicit (fall through) returns'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

* fix review comments

---------

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants