A C# ASP.NET Core backend with Blazor frontend for document ingestion, processing, and RAG (Retrieval-Augmented Generation) functionality using ChromaDB and Azure OpenAI.
- Backend: ASP.NET Core Web API
- Frontend: Blazor Server
- Vector Database: ChromaDB (via network HTTP service)
- Embeddings: Azure OpenAI text-embedding-ada-002
- Chat Model: Azure OpenAI GPT-4o (via L3Harris corporate gateway)
- Document Processing: PDF and Word document extraction
- ✅ PDF text extraction (
PdfTextExtractor.cs) - ✅ Word document text extraction (
WordTextExtractor.cs) - ✅ Text chunking service with configurable options (
TextChunkingService.cs) - ✅ Document factory pattern for processor selection (
DocumentProcessorFactory.cs)
- ✅ Azure OpenAI integration (
OpenAIEmbeddingService.cs) - ✅ Batch processing for embeddings (16 chunks per request)
- ✅ Rate limiting and error handling
- ✅ 1536-dimensional embeddings generated successfully
- ✅ Network HTTP service for ChromaDB operations (
chroma_http_service.py) - ✅ Fast HTTP client for eliminating Python startup overhead (
ChromaDbHttpClient.cs) - ✅ Collection management (create, add, query)
- ✅ Document upsert with embeddings and metadata
- ✅ Legacy Python bridge fallback (
ChromaDbClient.cs,chroma_bridge.py)
- ✅
/api/document/ingest-chunksendpoint - ✅ Processes existing chunks from
chunks_test.json - ✅ Returns detailed ingestion results with timing
Problem: ChromaDB Python bridge failing with "filename or extension is too long" error on Windows with OneDrive paths.
Root Cause:
- Very long OneDrive path:
C:\Users\...\OneDrive - L3Harris - GCCHigh\Documents\Code\chatbot_workingPDFextractionCopy\Backend - Windows limitations on both working directory path length and command line argument length
Solution Implemented:
- Short Path Strategy: Copy Python script to
C:\temp\cb.py - Short Working Directory: Use
C:\tempinstead of long project path - File-Based Data Transfer: Write JSON data to
C:\temp\data.jsoninstead of command line arguments - Python Bridge Updates: Added
--data-fileparameter support - Proper Cleanup: Temporary file cleanup after execution
Code Changes:
- Modified
ChromaDbClient.csto use temporary file approach - Updated
chroma_bridge.pyto handle--data-fileparameter - Added proper exception handling and cleanup
Result:
- ✅ 76/76 chunks successfully ingested
- ✅ No path length errors
- ✅ Complete document ingestion pipeline working
- Endpoint:
https://api-lhxgpt.ai.l3harris.com/cgp - Discovery: Corporate gateway ignores API version parameters and handles routing internally
-
GPT-4o ✅ - Currently deployed and working
- Deployment name:
gpt-4o - ~128k token context window
- Fast response times (300-600ms average)
- Deployment name:
-
GPT-3.5-turbo ✅ - Previously working (switched from)
- Deployment name:
gpt-35-turbo - ~4k token context window
- Very fast response times
- Deployment name:
-
text-embedding-ada-002 ✅ - Embeddings working
- 1536-dimensional embeddings
- Batch processing (16 chunks per request)
✅ CONFIRMED WORKING:
2023-12-01-preview✅ - Worked with gpt-35-turbo2024-02-01✅ - Worked with gpt-4o2024-06-01✅ - Worked with gpt-4o2024-10-01-preview✅ - Worked with gpt-4o2024-12-01-preview✅ - LATEST WORKING (current)
❌ CONFIRMED FAILING:
2024-05-13❌ - Failed (oddly, despite being between working versions)2024-11-20❌ - Failed (too new for Azure Gov)2024-12-15-preview❌ - Failed2025-01-15-preview❌ - Failed (2025 versions not available)
🤔 ODDITIES DISCOVERED:
-
Corporate Gateway Behavior: API versions are ignored by L3Harris gateway
- Can set ANY version (even
2024-1003450-01-preview) and it still works? think I just needed to rebuild instead of rerun. - Gateway possibly uses fixed internal API version?
- Can set ANY version (even
-
Version Gaps:
2024-05-13failed despite being between working2024-02-01and2024-06-01- Suggests either invalid version format or gateway validation quirks
{
"ChatDeploymentName": "gpt-4o",
"ApiVersion": "2024-12-01-preview",
"DeploymentName": "text-embedding-ada-002"
}- Document Ingestion: ✅ Multi-document support with 10 technical documents
- Embedding Generation: ✅ Azure OpenAI text-embedding-ada-002 (1536-dimensional)
- ChromaDB Storage: ✅ Vector search across collections with MMR and deduplication
- Chat Completions: ✅ Azure OpenAI GPT-4o integration (simplified single-call)
- Search API: ✅ Multi-document semantic search with source attribution
- Chat Interface: ✅ Full Blazor interactive UI with comprehensive debug transparency
- End-to-End Flow: ✅ Streamlined user experience with improved AI intelligence
- Backend APIs: ✅ Simplified endpoints with comprehensive cleanup and optimizations
- Code Quality: ✅ Removed unused code, eliminated hardcoded patterns, scalable metadata
- Architecture: ✅ Clean, maintainable single-stage RAG pipeline with pre-retrieval enhancements
- Document:
extracted.txt(76 chunks) - Embedding generation: 4034ms
- All 76 chunks successfully ingested
- Collection:
doc_extracted_8ef5ddf4
- ✅ Multi-collection search across documents (
QueryAllCollectionsAsync) - ✅ TPS2-100 Cable Assembly document (76 chunks) ingested
- ✅ TPS4-14 Labels/Markers/Decals document (57 chunks) ingested
- ✅
/api/document/searchendpoint with similarity ranking - ✅ Comprehensive metadata with source document tracking
- ✅ Azure OpenAI chat completion service (
OpenAIEmbeddingService.cs) - ✅
/api/document/chatendpoint for conversational AI - ✅ Conversation history management and context handling
- ✅ Configurable temperature and token limits
- ✅ Interactive chat UI with real-time messaging
- ✅ Modern chat bubble design with timestamps
- ✅ Input validation and loading states
- ✅ Settings panel for AI parameters
- ✅ Full conversation flow working end-to-end
Problem: No button clicks, key presses, or @bind directives worked in Blazor frontend. Users could not interact with the interface at all.
Root Cause: Missing @rendermode InteractiveServer directive in Blazor components. In .NET 9, Blazor components default to static server-side rendering without explicit interactivity mode.
Solution:
@rendermode InteractiveServerAdded to Home.razor component header.
Result: All event handlers, binding, and interactivity immediately started working.
Problem: Frontend consistently failed to connect to backend with "No connection could be made because the target machine actively refused it. (localhost:7001)" error.
Root Cause: Multiple configuration layers causing incorrect backend URL:
Program.csdefault:localhost:7001appsettings.jsonoverride:localhost:7001- Actual backend running on:
localhost:7000 - HttpClient factory caching old configuration
Solutions Applied:
- Fixed
Program.csdefault URL ❌ (still used 7001) - Fixed
appsettings.jsonconfiguration ❌ (still used 7001) - Final Fix: Hardcoded URL in
DocumentService.csconstructor ✅
_httpClient.BaseAddress = new Uri("http://localhost:7000");Result: Frontend successfully connects to backend on correct port.
- ✅ Per-document sampling ensures balanced representation from all documents
- ✅ Deduplication removes redundant content using text similarity analysis
- ✅ Configurable sampling strategies: "balanced" (min per doc + fill remaining) or "fixed" (equal allocation)
- ✅ Configurable deduplication threshold (0.0-1.0, default: 0.7)
- ✅ API parameters added to both search and RAG chat endpoints
- ✅ Python bridge enhanced with sampling statistics logging
Problem Solved: System was showing "source documents: 1" because all top search results came from the same document due to global similarity ranking.
Solution: Implemented minimum allocation per document (ensuring each document gets at least one chunk) plus intelligent remaining slot filling, combined with text-similarity-based deduplication to remove redundant content.
Result: RAG responses now consistently show "source documents: 2+" with more diverse, information-rich context from multiple documents.
- ✅ Implemented classic MMR algorithm:
MMR = λ * relevance - (1-λ) * max_similarity_to_selected - ✅ Two-phase MMR application: during per-document sampling AND final selection
- ✅ Iterative selection for optimal diversity progression (not batch MMR)
- ✅ Configurable λ parameter: 0.0=max diversity, 1.0=max relevance, 0.5=balanced (default)
- ✅ API integration:
EnableMmrandMmrLambdaparameters in search and RAG chat endpoints - ✅ Embedding-based semantic diversity calculation using cosine similarity
- ✅ Graceful fallback when embeddings unavailable
Problem Solved: Search results were returning multiple similar chunks with redundant information despite high relevance.
Solution: MMR balances relevance vs diversity, ensuring comprehensive non-redundant context. Example: query "How to test cables?" now returns testing procedure + safety requirements + quality standards + documentation + troubleshooting (instead of 5 similar testing chunks).
Result: Significantly enhanced RAG response quality with diverse, comprehensive information coverage.
Purpose: Simplified architecture for testing and improved maintainability
- Current Flow: User Query → Embedding → ChromaDB Search → Single GPT Call → Response
- Service Pattern: Streamlined interface-based DI services (
IEmbeddingService,IDocumentProcessor) - API Structure: Simplified
RagChatRequest/RagChatResponsewith essential parameters - Search Pipeline: Direct
QueryAllCollectionsAsyncwith configurable MMR, sampling, deduplication - Models: Clean request/response objects focused on core functionality
Rationale: Complex multi-stage pipeline was difficult to evaluate for incremental improvements Changes Made:
-
Removed Multi-Stage Services: Eliminated 5 complex AI services for testing clarity
OpenAIQueryExpansionService(Query expansion with synonyms)OpenAIGPTRerankingService(AI-powered result re-ranking)OpenAIQueryDecompositionService(Query complexity analysis and decomposition)OpenAIMultiStageRetrievalService(Parallel sub-question search with gap analysis)OpenAIMultiStageAnswerSynthesisService(Multi-stage result synthesis)
-
Removed Interfaces: Cleaned up unused interface definitions
IQueryExpansionService,IGPTRerankingService,IQueryDecompositionServiceIMultiStageRetrievalService,IMultiStageAnswerSynthesisService
-
Simplified Controller: Replaced complex multi-stage logic with straightforward RAG flow
- Single embedding generation for user query
- Direct ChromaDB vector search with MMR and deduplication
- Single GPT call with retrieved context
- Clean response formatting with source citations
-
Cleaned Request Models: Removed multi-stage parameters from
RagChatRequest- Removed:
EnableQueryExpansion,ExpansionStrategy,EnableGPTReranking - Removed:
EnableMultiStageRetrieval,MaxSubQuestions,EnableIterativeSearch - Kept: Core parameters for similarity, context chunks, temperature, tokens
- Removed:
Purpose: Remove unused code and maintain clean, focused codebase Actions Taken:
-
Removed Unused Methods: Eliminated 4 unused methods from DocumentController
GetApproximateChunkCountAsync(unused collection size estimation)ExtractDocumentDescriptionAsync(unused AI-based description extraction)GetSimpleDocumentDescription(unused hardcoded document pattern matching)CleanupFileName(unused filename normalization)
-
Removed Test Endpoints: Cleaned up legacy testing code
TestEmbeddingsendpoint (leftover embedding testing code)ProcessChunksendpoint (deprecated legacy ingestion method)
-
Removed Caching System: Eliminated unused document inventory caching
- Static cache fields and cache validation logic
GetCachedDocumentInventoryAsyncmethod- Associated cache management overhead
Purpose: Follow AI best practices of letting GPT handle natural language processing Changes Made:
-
Removed Name Extraction Logic: Eliminated hardcoded regex pattern for user names
- Previously: Regex pattern
@"my name is ([a-zA-Z]+)"with manual extraction - Now: Natural GPT-4o conversation context handling through conversation history
- Impact: No functional change - GPT-4o handles name recognition naturally
- Benefit: Cleaner code, more flexible name handling, eliminates hardcoded patterns
- Previously: Regex pattern
-
Dependency Injection Cleanup: Removed unused service registrations from Program.cs
- Cleaned up multi-stage service DI registrations
- Maintained core services: DocumentService, EmbeddingService, ChromaDbClient, etc.
User Query Input
↓
Query Embedding (OpenAI text-embedding-ada-002)
↓
ChromaDB Vector Search (with MMR and deduplication)
↓
Context Assembly (similarity filtering, source tracking)
↓
Single GPT Call (GPT-4o with context and conversation history)
↓
Response with Source Citations
Benefits of Simplified Architecture:
- Easier Testing: Clear cause-and-effect relationship between changes and results
- Improved Performance: Fewer API calls and processing steps
- Better Maintainability: Simpler codebase with focused functionality
- Cost Efficiency: Reduced OpenAI API usage while maintaining quality
- Cleaner Code: Eliminated unused methods and hardcoded logic patterns
- ✅ Functional: All endpoints working correctly after simplification
- ✅ Performance: Reduced API calls improve response times
- ✅ Code Quality: No compilation errors, cleaner codebase structure
- ✅ Maintainability: Easier to understand and modify individual components
When complexity can be justified through measurable improvements:
- Incremental Addition: Add features one at a time with A/B testing
- Performance Monitoring: Measure actual impact of each enhancement
- Cost Analysis: Track OpenAI API usage vs quality improvements
- User Feedback: Gather real-world usage data before architectural changes
Philosophy: Default to simplicity, add complexity only when benefits are clearly demonstrated.
Purpose: Provide complete visibility into RAG pipeline operations for debugging and optimization
Implementation:
- 4 Collapsible Debug Panels in chat interface:
- Pipeline Steps (8 steps): Complete workflow from query to response with timing
- Search Process: ChromaDB parameters, raw results, similarity scores
- GPT Input/Output: System prompt inspection, token usage, response analysis
- Performance Metrics: Step-by-step timing breakdown and duration analysis
Data Collected:
- Step-by-step timing (Query Processing, Query Refinement, Embedding, Vector Search, etc.)
- Search parameters and results (similarity thresholds, MMR settings, raw ChromaDB output)
- Token usage and model information (GPT-4o, text-embedding-ada-002 usage)
- Context assembly details (chunks selected, source documents, filtering results)
Benefits:
- Debugging: Easy identification of performance bottlenecks
- Optimization: Data-driven decision making for parameter tuning
- Transparency: Users can see exactly how the AI processes their queries
- Education: Clear visualization of RAG pipeline workflow
Purpose: Improve vector search effectiveness by enhancing user queries before embedding
Technical Implementation:
- AI-Powered Enhancement: GPT-4o adds 2-3 relevant technical synonyms to user queries
- Conversation Context: Includes conversation history for contextually aware refinement
- Token Optimization: Minimal usage (30 tokens max, temperature 0.1)
- Frontend Toggle: User-controllable enable/disable with visual indicator
Examples:
- "power issues" → "power issues electrical problems voltage failures"
- "navigation errors" → "navigation errors GPS malfunctions positioning failures"
Performance Impact:
- Additional Time: ~400ms per query (when enabled)
- Improved Results: Better semantic matching and document retrieval
- Cost: Minimal token usage per query
Problem Solved: Previous approach parsed all document names for categories (O(n) complexity, slow for 1000+ docs)
Solution Implemented:
- Fast Collection Count: O(1) operation using ChromaDB's
ListCollectionsAsync() - Eliminated Name Parsing: Removed expensive
ExtractCategoryFromDocumentName()method - Lightweight Metadata: Simple format providing essential context to AI
Before vs After:
// BEFORE: Expensive O(n) operation
var categories = allCollections?
.Select(name => ExtractCategoryFromDocumentName(name)) // Parse every name
.Where(cat => !string.IsNullOrEmpty(cat))
.Distinct().Take(5).ToList();
// AFTER: Fast O(1) operation
var documentMetadata = $"You have access to {totalDocCount} technical documents.
Current search found {searchResultCount} relevant chunks.";Performance Results:
- Timing: 2-3ms (down from potential hundreds of milliseconds)
- Scalability: Works efficiently with 10 or 1000+ documents
- Memory: Minimal memory usage regardless of collection size
Problem Solved: AI was incorrectly counting search result chunks instead of using total document metadata
Solution Implemented:
- Enhanced System Prompt: Explicit instructions for inventory questions
- Metadata Integration: AI uses total document count from metadata, not search results
- Query Type Classification: AI distinguishes between inventory vs content questions
Enhanced Prompt Addition:
IMPORTANT: When users ask about document inventory, use the total document count provided above,
not just the chunks returned from search.
Before you answer, think through your reasoning step by step:
1. Identify the question type (general overview vs. specific technical vs. inventory).
2. For inventory questions: Use the total document count from the metadata above.
3. For content questions: List which document chunks or sections are most relevant and why.
Results:
- Before: "I have access to 5 documents" (incorrect, counting search results)
- After: "I have access to 10 technical documents" (correct, using metadata)
Problem Solved: Frontend was hardcoding parameter values instead of reflecting backend defaults
Changes Made:
- Backend Defaults Updated: Temperature to 0.7 (not 0.1), MaxTokens to 4000 (not 1000)
- Frontend Null Parameters: All parameters now pass
nullto use backend defaults - Dynamic Display: Debug panels show actual backend values, not hardcoded frontend values
- Removed Hardcoded Defaults: Eliminated static values from
ChatRequestmodel
Impact:
- Consistency: Frontend always reflects actual backend behavior
- Flexibility: Backend changes automatically propagate to frontend
- Accuracy: Debug information shows true system parameters
Problem Solved: AI responses displayed raw Markdown characters instead of formatted text
Solution Implemented:
- Markdig Integration: Added Markdig NuGet package for Markdown processing
- CSS Optimization: Fixed spacing issues (
white-space: normalinstead ofpre-wrap) - HTML Rendering:
@((MarkupString)Markdown.ToHtml(message.Content)) - Typography: Added CSS rules for proper heading, list, and emphasis formatting
Before vs After:
- Before:
# Heading **bold** *italic*(raw characters displayed) - After: Properly formatted headings, bold text, italic text, lists
Problem Solved: API timeouts were too short for longer AI operations (30 seconds)
Solution Implemented:
- Standardized Timeouts: All OpenAI API calls now use 60-second timeout
- ChromaDB Timeout: Extended ChromaDB operations to 60 seconds
- Consistent Configuration: Unified timeout handling across all services
Services Updated:
OpenAIEmbeddingService: 60-second timeout for chat completion and embedding requestsChromaDbClient: 60-second timeout for vector search operationsChromaDbHttpClient: Network request timeout optimization
Problem Solved: Custom "balanced" vs "fixed" sampling logic was inefficient and unnecessary
Solution Implemented:
- Removed Custom Logic: Eliminated custom sampling strategy code from
chroma_bridge.py - ChromaDB Default: Use ChromaDB's optimized default global search
- Maintained Features: Kept deduplication and MMR for quality results
- Simplified Configuration: Removed unnecessary sampling parameters
Performance Impact:
- Faster Queries: Eliminated custom processing overhead
- Better Scaling: ChromaDB's optimized search handles large document collections efficiently
- Cleaner Code: Removed ~100 lines of custom sampling logic
| Metric | Before (Multi-Stage) | After (Simplified) | After (Optimized) | Total Improvement |
|---|---|---|---|---|
| Total Response Time | 25+ seconds | 15-18 seconds | 3-8 seconds | 17-22s faster |
| API Calls per Query | 6-8 calls | 2 calls | 3 calls (with refinement) | 3-5 fewer calls |
| Metadata Collection | Not implemented | 2,300ms | 2-3ms | 2.3s faster |
| Document Inventory | Incorrect results | Incorrect results | Correct results | Functionality fixed |
| Debug Visibility | None | Basic logging | Complete transparency | Full pipeline visibility |
| Code Complexity | High (multi-stage) | Medium | Low (optimized) | Significantly cleaner |
| Scalability | Limited | Moderate | 1000+ documents | Unlimited scaling |
The RAG system now represents an optimized balance of:
- Intelligence: Pre-retrieval enhancements and improved AI prompting
- Performance: Sub-10-second responses with scalable architecture
- Transparency: Complete pipeline visibility for debugging and optimization
- Maintainability: Clean, focused codebase with eliminated technical debt
- User Experience: Proper formatting, reliable timeouts, and intuitive controls
Problem: RAG chat responses were taking 25+ seconds, causing poor user experience.
Root Cause Investigation:
- Added timing analysis to identify bottlenecks
- Discovery: ChromaDB Python bridge was consuming 70% of response time (12 seconds out of 17-second backend)
- Each request spawned new Python process with startup overhead:
- Python interpreter startup: ~2-3 seconds
- ChromaDB library import: ~2-3 seconds
- Database connection: ~1-2 seconds
- Process cleanup: ~1 second
- Total overhead per request: 6-9 seconds
Implementation:
-
Persistent Python Service (
chroma_http_service.py)- Flask HTTP service running on network location (138.254.160.169:8001)
- ChromaDB client initialized once at startup (no per-request overhead)
- RESTful endpoints:
/query-all,/list,/health
-
Fast HTTP Client (
ChromaDbHttpClient.cs)- Direct HTTP calls to persistent Python service
- Eliminates process spawning completely
- Same interface as legacy Python bridge for seamless migration
-
Network Deployment:
- ChromaDB server: Network drive Z:\chromaDB (port 8000)
- Python HTTP service: Network drive Z:\chroma-service (port 8001)
- Updated configuration to use network IP addresses
| Metric | Before (Python Bridge) | After (Network HTTP) | After Optimizations | Total Improvement |
|---|---|---|---|---|
| Document Inventory | 2,300ms (2.3s) | 2,300ms (2.3s) | 0ms (cached) ✅ | 2.3s saved |
| ChromaDB Query | 12,000ms (12.0s) | 8,770ms (8.8s) | ~7,000ms (fewer chunks) ✅ | 5.0s saved |
| Total Backend | 17,000ms (17.0s) | 15,600ms (15.6s) | ~11,300ms ✅ | 5.7s saved |
| Total Response | 25,000ms (25.0s) | 15,800ms (15.8s) | ~10,500ms ✅ | 14.5s saved |
| Estimated New Total | - | - | ~10.5 seconds | 58% faster |
Current Performance (Network HTTP Service):
- Document Inventory: 2,281ms (2.3s) - Network collection listing
- Query Expansion: 590ms (0.6s) - OpenAI API
- Query Embedding: 380ms (0.4s) - OpenAI API
- ChromaDB Vector Search: 8,770ms (8.8s) - Network HTTP service
- GPT Re-ranking: 1,716ms (1.7s) - OpenAI API
- RAG Response: 1,842ms (1.8s) - OpenAI API
Total: ~15.8 seconds (down from 25+ seconds)
Short-term optimizations (2-5 second savings):
- Document Inventory Caching: Cache collection list for 30 minutes (save 2.3s)
- ChromaDB Query Optimization: Tune search parameters and indexing (save 2-3s)
- Parallel Processing: Run re-ranking and embedding in parallel where possible (save 1-2s)
Medium-term optimizations (5-10 second savings):
- Response Streaming: Stream responses as they're generated instead of waiting for completion
- Smart Caching: Cache query embeddings and search results for repeated queries
- Native .NET ChromaDB Client: Eliminate Python dependency entirely
Production-scale optimizations:
- Dedicated ChromaDB Server: Move to high-performance server for 1000+ users
- Azure AI Search Migration: Use managed vector database service for enterprise scale
- CDN & Edge Caching: Distribute responses geographically for global users
Target Performance Goals:
- Current: 15.8 seconds
- Short-term target: 8-10 seconds
- Production target: 3-5 seconds
- Enterprise target: 1-2 seconds with streaming
The system implements a streamlined, high-performance RAG pipeline with intelligent pre-retrieval enhancements, comprehensive debugging transparency, and scalable metadata collection.
Input: User query text with conversation history Process: Input validation and query preparation Performance: <1ms (immediate processing)
Purpose: Optimize vector search effectiveness by enhancing user queries Process:
- AI-Powered Refinement: GPT-4o analyzes user query and adds 2-3 relevant technical synonyms
- Configurable: Frontend toggle enables/disables query refinement (default: enabled)
- Context-Aware: Includes conversation history for better refinement decisions
- Efficient: Minimal token usage (30 tokens max, temperature 0.1) Example: "power issues" → "power issues electrical problems voltage failures" Performance: ~400ms additional processing time Benefits: More precise document retrieval, better semantic matching
Process:
- Generate embedding using Azure OpenAI text-embedding-ada-002
- Uses refined query (if enabled) instead of original query for better vector search
- 1536-dimensional vector representation Performance: ~120ms average response time
Process:
- ChromaDB Search: HTTP POST to
/query-allendpoint - MMR Application: Maximal Marginal Relevance for diversity (λ=0.5)
- Deduplication: Text similarity filtering (threshold=0.7)
- Global Search: Efficient search across all document collections Parameters:
- Similarity threshold: 0.4 (40% minimum cosine similarity)
- Max context chunks: 5 (configurable, optimized for hundreds of documents)
- Default MMR enabled for relevance-diversity balance
Purpose: Provide AI with document inventory context without performance penalty Process:
- Fast Collection Count: O(1) operation to get total document count from ChromaDB
- No Name Parsing: Eliminated expensive document name analysis (scales to 1000+ docs)
- Lightweight Metadata: Simple format - "You have access to X technical documents. Current search found Y relevant chunks." Performance: ~3ms (down from potential hundreds of milliseconds for large collections) Scalability: Efficient for any number of documents (10 to 1000+)
Process:
- Filter results by similarity threshold
- Extract metadata (source documents, chunk IDs)
- Format context with source citations
- Apply conversation history if provided
Process:
- Enhanced System Prompt: Explicit instructions for inventory vs content questions
- Metadata-Aware: "IMPORTANT: When users ask about document inventory, use the total document count provided above, not just the chunks returned from search."
- Context Integration: Retrieved chunks with source attribution + document metadata
- Single API Call: GPT-4o generates final response with improved intelligence
- Citation Format: [Source: DocumentName] style references Performance: ~650ms average response time (optimized)
- Total Response Time: ~3-8 seconds (optimized from 25+ seconds)
- API Calls per Query: 3 calls with refinement (2 calls without)
- Cost Efficiency: Significant reduction in OpenAI API usage with targeted token usage
- Reliability: Simplified error handling and comprehensive debugging
- Scalability: Optimized for 1000+ documents without performance degradation
User Query → [Query Refinement] → Embedding Generation → ChromaDB Search (MMR) →
Metadata Collection → Context Assembly → Intelligent GPT Response → User
- ✅ Multi-Document Search: Efficient search across all document collections
- ✅ Pre-Retrieval Optimization: AI-powered query enhancement for better results
- ✅ Source Attribution: Clear citation of source documents
- ✅ Conversation History: Maintains context across chat turns
- ✅ Configurable Parameters: Query refinement toggle, similarity thresholds, context limits, temperature
- ✅ Error Handling: Graceful fallbacks for API failures
- ✅ Comprehensive Debug Transparency: 8-step pipeline visibility with timing metrics
- ✅ Intelligent Document Inventory: AI correctly handles "how many documents" questions
- ✅ Scalable Metadata: Efficient collection info for any document volume
- ✅ Frontend Parameter Sync: Dynamic display of backend defaults (no hardcoding)
- Phase 3: Document routing for scaling to hundreds of documents (low priority, later)
- Implement document upload functionality (low priority, later)
- Add authentication and user management (low priority, later)
- Performance optimizations and caching (low priority, later)
The entire AI pipeline codebase has been comprehensively documented using a beginner-friendly commenting style designed to make RAG concepts accessible to developers at all levels.
Comment Structure:
// ================================
// MAJOR SECTION: DESCRIPTIVE PURPOSE
// ================================
// Plain English explanation of what this section does and why it matters
// Uses analogies and real-world examples for complex AI concepts
var result = SomeFunction(); // Inline explanation of specific operationsEducational Approach:
- For Beginners: Explains AI concepts like embeddings ("converting text to searchable numbers"), vector databases ("Google for your documents"), and RAG workflow
- For Developers: Documents exact data flow, API interactions, error handling, and performance considerations
- For Business: Clarifies user experience, citation functionality, and smart vs regular chat modes
Files Documented:
Backend/Controllers/DocumentController.cs- Complete RAG workflow (8 major steps)Backend/Services/OpenAIEmbeddingService.cs- AI chat and embedding generationBackend/Services/ChromaDbClient.cs- Vector database search operationsFrontend/Services/DocumentService.cs- Frontend-backend RAG communicationFrontend/Components/Pages/Home.razor- UI mode switching and response displayFrontend/Models/ChatModels.cs- RAG data structures and response models
This documentation style transforms complex AI pipeline code into an educational resource that explains both the "what" and "why" of every major component.
- Developed on Windows 10 with OneDrive Business paths
- .NET 9.0
- Python 3.x with ChromaDB client
- Azure OpenAI API integration
Last Updated: December 2024 - Optimized RAG Architecture with Enhanced Intelligence and Debug Transparency