Chunker v2.1.33 - Enterprise RAG-Powered File Processing System

For AI/Claude project context and workflow guidance, see Claude.md.

What this repo does

Four concerns, now mapped to named sub-packages so the scope boundaries are legible without relocating files:

Purpose	Sub-package	Key modules
1. Ingest AI chat logs + chunk for archival	`ingest/`	`watcher_splitter.py`, `file_processors.py`, `metadata_enrichment.py`
2. Feed prior conversation context back to AI	`search/`	`ask.py`, `kb_ask_ollama.py`, `rag_integration.py`
3. AI-interactive file search	`search/`	`gui_app.py` (Streamlit), `api_server.py` (FastAPI), `rag_search.py`
4. Knowledge base management	`kb/`	`backfill_knowledge_base.py`, `chromadb_crud.py`, `deduplication.py`, `backup_manager.py`

Shared plumbing (SQLite pool, monitoring, watchdog debounce, job integrity, query cache) lives in infra/. The optional RAG evaluation harness lives in evaluation/ (install with pip install -r requirements-eval.txt).

The sub-package __init__.py files re-export the existing root modules; this is a soft boundary layer, not a rewrite. Existing scripts continue to work.

Installing

pip install -r requirements.txt          # minimal core (watcher + tests)
pip install -r requirements-app.txt      # full RAG stack + search UIs
pip install -r requirements-eval.txt     # optional RAG evaluation harness
pip install -r requirements-dev.txt      # linters + test extras

Running the search surfaces

streamlit run gui_app.py           # browser UI
uvicorn api_server:app --reload    # REST API (docs at /docs)
python ask.py "your query"         # CLI

KB maintenance tools

Tag upkeep:

tools/audit_tags.py — audit ChromaDB tag metadata against the canonical taxonomy.
tools/retag_chromadb.py — re-run metadata_enrichment.enrich_metadata and upsert tags (dry-run by default; pass --apply).

KB cleanup — directory-level (primary pipeline; one conversation = one directory):

tools/kb_inventory.py --root "%OneDriveCommercial%\KB_Shared\04_output" --skip-hash — walk the KB, emit file-level CSV including a conversation_dir column.
tools/kb_directory_staleness.py — group by base project slug across conversation_dirs, flag older duplicates. Marks safe_archive when every file hash in the older dir appears in the newer one; otherwise review.
tools/kb_archive_dirs.py --root <root> — dry-run first; add --apply to move whole directories to <root>/_archive/YYYY-MM-DD-dir-cleanup/ and delete their chunk ids from ChromaDB. Rollback manifest written next to the archive.

KB cleanup — file-level (edge cases where dirs don't apply):

tools/kb_staleness_report.py and tools/kb_archive.py — same pipeline but at file granularity. Use when conversation_dirs are flat.

Production KB lives at %OneDriveCommercial%\KB_Shared\04_output (see config.json). Sidecar JSON generation is now disabled by default ("enable_json_sidecar": false); ChromaDB metadata is the source of truth.

What changed in v2.1.33

Claude Code chunk-chat Skill: New skill at .claude/skills/chunk-chat/ replicates the core chunker pipeline directly inside a Claude Code session, eliminating the manual workflow of exporting chat logs, staging them in 02_data/, and copying output back.
- .claude/scripts/chat_chunker.py: Standalone Python chunker (stdlib only, zero external dependencies) with sentence-based splitting, overlap, tag detection, key term extraction, and full output artifact generation (chunks, transcript, sidecar JSON, origin manifest).
- .claude/skills/chunk-chat/SKILL.md: Skill definition that captures conversation context and runs the chunker inline. Invoke with /chunk-chat or say "chunk this chat".
- Output lands in ./chunked_chat/ in the current working directory -- no staging or copying required.

What changed in v2.1.32

See CHANGELOG.md for full details.

Logging layout: All watcher-related logs under logs\ (including Task Scheduler / silent start). Use scripts\Archive-ChunkerLogs.ps1 to move legacy 05_logs\ or rotated watcher_archive_*.log into logs\archive\.
KB-PathSafetyReport.ps1: Fixed watcher PID check (do not assign to $pid).

What changed in v2.1.31

See CHANGELOG.md for full details.

SQLite logging: chunker_db.log_processing() commits processing_history before updating department_stats, avoiding nested-connection lock contention that could delay or hang manual_process_files.py after a successful job.
Documentation: Version sync across core docs; clarified manual_process_files.py --auto (scans watch_folder from config.json); troubleshooting for "Department stats update locked" / apparent manual-run hangs.

What changed in v2.1.30

See CHANGELOG.md for full details.

Documentation: Watcher startup docs updated for Start-Watcher.ps1 (KB_Shared detection from config.json + OneDrive env fallbacks, python / py -3 launcher), deduplication/ChromaDB import resilience notes, corrected log paths in quick reference, troubleshooting addendum in 07_docs/File_Processing_Investigation_Report.md.

What changed in v2.1.29

See CHANGELOG.md for full details.

Documentation Updates: Aligned all project documentation with current code and configuration. Fixed broken doc references in Claude.md, updated supported extensions and query_cache config in README, corrected config settings in SUMMARY.

What changed in v2.1.28

See CHANGELOG.md for full details.

Incremental Updates Skip Logic: Unchanged files are now skipped when incremental updates are enabled. VersionTracker.has_changed() is consulted before reprocessing; matching content hash returns early.
Test Suite Fixes: All 76 tests pass. Fixed test_incremental_updates_skip_reprocessing (recreate file after archiving) and job integrity fixtures (artifact sizes ≥ 50 bytes).

What changed in v2.1.27

See CHANGELOG.md for full details.

Pydantic-Core Compatibility Fix: Resolved watcher startup crash caused by pydantic-core 2.42.0 incompatibility with pydantic 2.12.x. ChromaDB/deduplication requires pydantic-core==2.41.5. Fix: pip install pydantic-core==2.41.5
Documentation Updates: Corrected log paths across project docs (logs/ for watcher.log and watcher_start.log; 05_logs/ for silent start). Updated troubleshooting for stale watcher PID, pydantic crashes, and file processing issues.
KB-Health Script: Health check reports Watcher status, OneDrive paths, ChromaDB presence, and directory accessibility. Output/archive paths read from config.json.

What changed in v2.1.26

See CHANGELOG.md for full details.

Job Integrity Validation System: Comprehensive post-processing validation prevents incomplete outputs from being archived as successful jobs
- Validates all required artifacts (chunks, transcript, sidecar, manifest) with size thresholds
- File stability checks prevent OneDrive sync race conditions (2s window, 30s timeout)
- Intelligent retry logic for timing-related failures (15s delay, max 1 retry)
- Detailed failure reports with full diagnostics in {job_id}.integrity_fail.log
- Artifact quarantine moves failed job files to failed/ folder
- Multi-channel notifications (Windows toast, log markers, console)
- 19 unit tests (all passing), comprehensive documentation
- 100% backward compatible - can be disabled via config
Enhanced Logging: Start-Watcher.ps1 now shows startup banner with all monitored paths
ChromaDB Metadata Fix: Fixed validation error by converting lists to strings and handling None values

What changed in v2.1.25

See CHANGELOG.md for full details.

Orphaned Manifest File Fix: Fixed bug where .origin.json manifest files were left in 02_data after processing. Both archive and quarantine functions now properly move manifest files.
Cleanup Script: Added Cleanup-Orphaned-Manifests.ps1 to detect and archive orphaned manifests with dry-run and verbose modes.

What changed in v2.1.24

See CHANGELOG.md for full details.

KB Consolidation Complete: Backfilled 8,136 chunks to ChromaDB knowledge base (8,142 total). Full semantic search now operational with verified query results.
Pydantic 2.x Compatibility Fix: Fixed rag_integration.py to work with pydantic 2.12+ by manually injecting private attributes into ChromaDB Collection objects.
Output Consolidation Script: Added Move-Unique-Local-Output-To-OneDrive.ps1 for consolidating local output folders to OneDrive KB_Shared.
Documentation Consolidation: Archived duplicate docs from 07_docs/ to 07_docs/archived_20260130/. Root directory now has single canonical versions of all key documentation files.
Config.json Enhanced: Merged additional settings from 07_docs config - now supports 15 file types (added pdf, docx, xlsx, yaml, etc.), auto KB insertion, Ollama embedding config, and performance tuning options.
Completion Report: Added 07_docs/KB_CONSOLIDATION_COMPLETION_REPORT.md and 07_docs/DOC_CONSOLIDATION_REPORT_20260130.md documenting consolidation processes.

What changed in v2.1.23

See CHANGELOG.md for full details.

ChromaDB backfill and verify: Pydantic 2.12+ and ChromaDB 0.3.x support. Dummy-embedding fallback when sentence_transformers unavailable. Backfill and verify scripts updated.
Chunk ID sanitization: Backfill sanitizes chunk IDs for ChromaDB/DuckDB so quotes and special chars do not break SQL.
Ollama RAG: Ollama integration tolerates missing deps with clear install message. Scripts to pull nomic-embed-text and run backfill with real embeddings (Python 3.11 venv).
Department migration: 20+ departments and priority-based detection from laptop source. Department as fallback for archive (flat 03_archive when default).
ChromaDB guide: Optional real embeddings, RAG search (Ollama + FAISS), chunk ID sanitization, Ollama on Windows.

What changed in v2.1.22

See CHANGELOG.md for full details.

Unicode Encoding Fix: Fixed Unicode encoding errors in manual_process_files.py when running on Windows console
- Replaced Unicode checkmark/cross characters with ASCII-safe alternatives
- Added UTF-8 encoding handling for Windows console output
- Fixed file existence check to prevent errors during batch processing

What changed in v2.1.21

See CHANGELOG.md for full details.

File Processing Investigation Report: Added comprehensive troubleshooting guide for unprocessed files
- Root cause analysis for files not being processed by the chunker
- Configuration verification procedures
- Manual processing solutions using manual_process_files.py
- Watcher status checking and restart procedures
- Documentation at 07_docs/File_Processing_Investigation_Report.md

What changed in v2.1.20

See CHANGELOG.md for full details.

OneDrive SYNC Folder Complete Removal: Successfully removed SYNC folder from both local and cloud
- Created Move-SYNC-To-Temp-And-Delete.ps1 - Moves SYNC contents to C:\TEMP and deletes empty SYNC folder
- Successfully moved 35 items (41.85 MB) to backup and deleted SYNC folder locally
- User deleted SYNC folder from web interface, triggering OneDrive deletion sync
- OneDrive processing 178,824 file deletions (30-60 minutes estimated)
- Additional troubleshooting scripts for stuck sync operations and folder blocking issues

What changed in v2.1.19

See CHANGELOG.md for full details.

OneDrive SYNC Directory Removal: Added automated scripts to remove problematic SYNC directories causing recurring "path is too long" sync errors
- Safely removes 4 directories (565,962 files, ~165 GB) from OneDrive sync
- Uses robocopy for better OneDrive cloud file handling, then deletes directories
- Files moved to backup location initially, then deleted after verification
- Complete documentation and troubleshooting guides included
- Connection issue resolution scripts and monitoring tools added
- Cloud sync status checking and laptop sync guidance provided

What changed in v2.1.18

See CHANGELOG.md for full details.

OneDrive Desktop Auto-Repair: Added comprehensive PowerShell scripts to detect and fix Desktop misalignment between Windows and OneDrive, resolving "dual desktop" sync issues
OneDrive Desktop Post-Verification Monitor: Added continuous monitoring script (OneDrive_Desktop_PostVerify_Monitor.ps1) that checks every 30 seconds until all Desktop alignment checks pass, with color-coded dashboard, sync status analysis, status histogram, and auto-exit on success
Desktop Path Detection Fix: Fixed false negative in monitor script by using Windows Known Folder API to correctly detect OneDrive-redirected Desktop paths
Sync Status Improvements: Enhanced sync status reporting to properly identify folders vs files and provide status breakdown histogram

What changed in v2.1.17

See CHANGELOG.md for full details.

Fixed 85+ recursive paths causing OneDrive sync failures, including SB_160116 hall of mirrors issue

Version 2.1.19 - OneDrive SYNC directory removal scripts to fix recurring path length sync errors, with robocopy-based file handling and comprehensive documentation.

Version 2.1.18 - OneDrive Desktop Auto-Repair scripts with continuous monitoring, Known Folder API path detection fix, and enhanced sync status reporting.

Version 2.1.17 - ChromaDB rebuilt with compatibility fixes, streamlined release automation, refreshed documentation, comprehensive regression coverage, plus watcher stability and SQLite hardening.

What's New in v2.1.8+

Recent Improvements (Post-v2.1.8)

Tiny File Archiving: Files under 100 bytes are automatically parked in 03_archive/skipped_files/ with their manifests to eliminate endless “too small” retries.
Manifest & Hash Safety: Watcher now skips any file containing .origin.json in its name and recomputes content hashes when the manifest is missing a checksum so incremental tracking remains intact.
Chunk Writer Hardening: Consolidated write_chunk_files() helper creates the directory once, writes UTF-8 chunks with defensive logging, and copy_manifest_sidecar() guarantees parent folders exist before copying manifests.
Parallel Queue Handling: Added optional multiprocessing.Pool batches for queues ≥32 files (config flag), plus automatic pruning of the processed_files set to prevent long-running watcher stalls.
Tokenizer & Metrics Optimizations: Sentence tokenization is LRU-cached, system metrics run on a background executor, and notification bursts are throttled with a 60-second rate limiter per alert key.
SQLite Resilience: Centralized _conn() helper sets 60 s timeouts, log_error() now understands both legacy signatures and retries lock errors, and run_integrity_check() validates the DB at startup.
Test Coverage & Pytest Guardrails: Root conftest.py skips bulky 99_doc/legacy suites and tests/test_db.py smoke-tests the new retry path to ensure future regressions fail fast.
Database Lock Monitoring: MONITOR_DB_LOCKS.md documents command-line checks, baseline metrics (1.5 errors/min), and alert thresholds (3 errors/min = 2× baseline).
Watcher Bridge Support: watcher_splitter.py understands .part staging files, waits for optional .ready signals, retries processing up to three times, and quarantines stubborn failures to 03_archive/failed/.
Batched Chroma Ingest: ChromaRAG.add_chunks_bulk() honours batch.size, skips null embeddings, and refreshes hnsw:search_ef from config.json so the vector store keeps pace with high-volume ingest.

v2.1.8 Release (2025-11-07)

ChromaDB Rebuild: Upgraded to chromadb 1.3.4, recreated the collection, and re-ran the backfill so 2,907 enriched chunks are in sync with the latest pipeline.
Dedup Reliability: deduplication.py now ships with hnswlib compatibility shims, letting python deduplication.py --auto-remove complete without legacy metadata errors.
Release Helper: scripts/release_commit_and_tag.bat automates doc staging, backups, commit/tag creation, and pushes while rotating logs; the 2025-11-07 dry run and live validation are logged in docs/RELEASE_WORKFLOW.md.
Regression Tests: Replaced placeholder suites with 52-case pytest coverage for query caching, incremental updates, backup management, and monitoring to mirror the production APIs.
Watcher & DB Resilience (Nov 2025): Skips manifests/archives/output files, sanitises output folder names, replaces Unicode logging arrows, adds safe archive moves, and introduces exponential-backoff SQLite retries to squash recursion, path-length, and “database locked” errors.

What changed in v2.1.8? See the [changelog entry](./CHANGELOG.md#v2.1.17 - --2025-11-07---chromadb-rebuild--release-automation).

🚀 What's New in v2.1.6

🚀 RAG Backfill Optimization

Multiprocessing: Parallel file processing and ChromaDB inserts with 4-8 workers
Performance: 20x faster backfill (100-200 chunks/second vs 5 chunks/second)
Batch Optimization: Optimized batch sizes (500-1000 chunks) for ChromaDB efficiency
HNSW Tuning: Proper vector index configuration (M=32, ef_construction=512, ef_search=200)
CPU Monitoring: Real-time CPU tracking with saturation alerts
Duplicate Detection: Pre-insertion verification prevents duplicate chunks
Verification Tools: Comprehensive scripts to verify backfill completeness

📊 Verification & Validation

Empty Folder Logging: Identifies folders without chunk files
Count Discrepancy Alerts: Warns when expected vs actual counts differ
Chunk Completeness Verification: Validates all chunks from all folders are in KB
Performance Metrics: Detailed throughput, memory, and CPU statistics

🚀 What's New in v2.1.5

📦 Move-Based Workflow (Grok Recommendations)

⚡ Storage Optimization: Reduced storage overhead by 50-60% via MOVE operations instead of COPY
🔗 OneDrive Sync Elimination: 100% reduction in sync overhead by moving files out of OneDrive
📋 Manifest Tracking: Complete origin tracking with .origin.json files
🔄 Enhanced Archive: MOVE with 3 retry attempts and graceful fallback to COPY
🎯 Department as fallback: Archive uses department subfolders only when department is explicit (path or enrichment); default department files go to flat 03_archive/
🔍 Smart Retry Logic: Handles Windows permission issues with automatic retries

🚀 What's New in v2.1.2

🚨 Critical Performance Fixes

⚡ Processing Loop Resolution: Fixed infinite loops that caused system hangs
📁 Smart File Archiving: Failed files automatically moved to organized archive folders
🔒 Database Stability: Eliminated "database is locked" errors with batch operations
⚡ 8-12x Speed Improvement: Dynamic parallel workers and optimized processing

🚀 Performance Enhancements

🔍 Advanced RAG System: Ollama + FAISS for local embeddings and semantic search
📊 Comprehensive Evaluation: Precision@K, Recall@K, MRR, ROUGE, BLEU, Faithfulness scoring
🔗 LangSmith Integration: Tracing, evaluation, and feedback collection
⚡ Real-time Monitoring: Watchdog-based file system monitoring with debouncing
🤖 Hybrid Search: Combines semantic similarity with keyword matching
📈 Automated Evaluation: Scheduled testing with regression detection
🛡️ Production Ready: Graceful degradation, error handling, and monitoring
📂 Source Folder Copying: Configurable copying of processed files back to source locations

Directory Structure

C:/_chunker - Main project directory with scripts
.claude/ - Claude Code skill and scripts (chunk-chat skill, chat_chunker.py)
02_data/ - Input files to be processed (watch folder)
03_archive/ - Archived original files (in OneDrive KB_Shared when configured)
03_archive/skipped_files/ - Files too small to process (< 100 bytes) - automatically archived
04_output/ - Generated chunks and transcripts (in OneDrive KB_Shared when configured)
logs/ - All runtime logs (watcher.log, watcher_start.log, watcher_start_silent.log, manual_process.log); older material under logs/archive/ after scripts\Archive-ChunkerLogs.ps1
06_config/ - Configuration files
99_doc/legacy/ - Consolidated legacy docs (latest snapshot per project)
06_config/legacy/ - Consolidated legacy config (latest snapshot per project)
logs/archive/ - Migrated or rotated logs (optional; created by archive script)
03_archive/legacy/ - Consolidated legacy db/backups (latest snapshot per project)
chroma_db/ - ChromaDB vector database storage
faiss_index/ - FAISS vector database storage
evaluations/ - RAG evaluation results
reports/ - Automated evaluation reports

🚀 Quick Start

Basic Usage (Core Chunking)

Place files to process in 02_data/ folder
Run the watcher: python watcher_splitter.py
Check 04_output/ for processed chunks and transcripts
Original files are moved to 03_archive/ after processing

Advanced Usage (RAG-Enabled)

Install RAG dependencies: python install_rag_dependencies.py
Install Ollama and pull model: ollama pull nomic-embed-text
Enable RAG in config.json: Set "rag_enabled": true
Run the watcher: python watcher_splitter.py
Search knowledge base: python rag_search.py

Advanced Usage (Celery-Enabled)

For high-volume processing and advanced task management:

Install Celery Dependencies:
```
pip install celery redis flower
```

Start Redis Server:

# Windows: Download from https://github.com/microsoftarchive/redis/releases
redis-server

# Linux: sudo apt-get install redis-server
# macOS: brew install redis

Start Celery Services:

# Option A: Use orchestrator (recommended)
python orchestrator.py

# Option B: Start manually
celery -A celery_tasks worker --loglevel=info --concurrency=4
celery -A celery_tasks beat --loglevel=info
celery -A celery_tasks flower --port=5555
python enhanced_watchdog.py

Monitor Tasks:
- Flower Dashboard: http://localhost:5555 (with authentication)
- Celery CLI: celery -A celery_tasks inspect active
- Logs: Check logs/watcher.log
Security & Priority Features:
- Flower Authentication: Default credentials logged on startup
- Priority Queues: High-priority processing for legal/police files
- Redis Fallback: Automatic fallback to direct processing if Redis fails
- Task Timeouts: 300s hard limit with graceful handling

Configuration:

{
  "celery_enabled": true,
  "celery_broker": "redis://localhost:6379/0",
  "celery_task_time_limit": 300,
  "celery_worker_concurrency": 4,
  "priority_departments": ["legal", "police"]
}

Environment Variables (Optional):

export FLOWER_USERNAME="your_username"
export FLOWER_PASSWORD="your_secure_password"

⚙️ Feature Toggles & Setup

All new subsystems ship disabled by default so existing deployments behave exactly as before. Enable individual features by updating config.json in the project root.

Metadata Enrichment (`metadata_enrichment`)

Adds semantic tags, key terms, summaries, and source metadata to chunk sidecars and manifests.
Output schema documented in docs/METADATA_SCHEMA.md.

Enable with:

"metadata_enrichment": { "enabled": true }

Monitoring & Health Checks (`monitoring`)

Background thread performs disk, throughput, and ChromaDB checks and escalates via the notification system.
Configure thresholds in config.json under the monitoring section; default recipients come from notification_system.py.
Start by setting "monitoring": { "enabled": true }.

Database Lock Monitoring

Current Performance: 1.5 database lock errors/minute baseline (68% reduction from previous)
Monitoring Documentation: See MONITOR_DB_LOCKS.md for comprehensive monitoring commands and alert thresholds

Real-time Monitoring:

# Watch for lock errors in real-time
powershell -Command "Get-Content watcher_live.log -Wait | Select-String -Pattern 'Failed to log|database is locked'"

# Check hourly error count
powershell -Command "(Get-Content watcher_live.log | Select-String -Pattern 'Failed to log processing' | Select-Object -Last 100 | Measure-Object).Count"

Alert Threshold: Flag if errors exceed 3/minute (2x current baseline)
Review Schedule: Monitor every 8-12 hours using commands in MONITOR_DB_LOCKS.md
Key Findings:
- 92% of lock errors occur in log_processing() (lacks retry wrapper)
- 8% in _update_department_stats() (has 5-retry exponential backoff)
- Current retry config: get_connection (3 retries), dept_stats (5 retries with 1.5x backoff)

Tiny File Handling

Automatic Archiving: Files under 100 bytes (default min_file_size_bytes) are automatically moved to 03_archive/skipped_files/
Examples: Empty files, "No measures found" messages, test placeholders
Behavior: Files are preserved with their .origin.json manifests for review rather than deleted or left to trigger repeated warnings
Configuration: Adjust threshold in department config via min_file_size_bytes parameter (default: 100)
Logs: Look for [INFO] File too short (X chars), archiving: filename messages

Deduplication (`deduplication`)

Prevents duplicate chunks from entering ChromaDB in both watcher and backfill flows.
Optionally run cleanup via python deduplication.py --auto-remove.
Already present in config.json; flip "enabled": true to activate.

Query Cache (`query_cache`)

Enables an in-memory LRU + TTL cache in rag_integration.py so repeat queries avoid hitting ChromaDB.
Configure ttl_seconds, max_entries under the query_cache section.
API users can inspect runtime metrics via GET /api/cache/stats once the cache is enabled.

Example from config.json:

"query_cache": {
  "enabled": true,
  "ttl_seconds": 600,
  "max_entries": 512
}

Incremental Updates (`incremental_updates`)

Uses a shared VersionTracker to hash inputs, skip untouched files, remove old chunk IDs, and persist deterministic chunk identifiers.
Tracker state defaults to 06_config/file_versions.json (override with version_file).

Typical configuration:

"incremental_updates": {
  "enabled": true,
  "version_file": "06_config/file_versions.json",
  "hash_algorithm": "sha256"
}

After enabling, unchanged files are skipped by the watcher/backfill, while reprocessed sources clean up stale artifacts before writing new chunks.

Backup Manager (`backup`)

Creates compressed archives of ChromaDB and critical directories on a schedule.
Configure destination, retention, and schedule in the backup section.
Manual run: python backup_manager.py --config config.json create --label on-demand.

After toggling features, restart the watcher (python watcher_splitter.py) so runtime components reinitialize with the new configuration.

✨ Features

Core Chunking

Organized output by source file name with timestamp prefixes
Multi-file type support - .txt, .md, .csv, .json, .yaml, .py, .m, .dax, .ps1, .sql, .pdf, .docx, .xlsx, .xls, .slx
Unicode filename support - Handles files with emojis, special characters, and symbols
Enhanced filename sanitization - Automatically cleans problematic characters
Database tracking and logging - Comprehensive activity monitoring
Automatic file organization - Moves processed files to archive

RAG System (v2.0)

Ollama Integration - Local embeddings with nomic-embed-text model
FAISS Vector Database - High-performance similarity search
Hybrid Search - Combines semantic similarity with keyword matching
ChromaDB Support - Alternative vector database (optional)
Real-time Monitoring - Watchdog-based file system monitoring
Debounced Processing - Prevents race conditions and duplicate processing

Performance & Scalability (v2.1.2)

Dynamic Parallel Processing - Up to 12 workers for large batches (50+ files)
Batch Processing - Configurable batch sizes with system overload protection
Database Optimization - Batch logging eliminates locking issues
Smart File Archiving - Failed files automatically moved to organized folders
Real-time Performance Metrics - Files/minute, avg processing time, peak CPU/memory
500+ File Capability - Handles large volumes efficiently without loops or crashes
Source Folder Copying - Configurable copying of processed files back to source locations

Evaluation & Quality Assurance

Comprehensive Metrics - Precision@K, Recall@K, MRR, NDCG@K
Generation Quality - ROUGE-1/2/L, BLEU, BERTScore
Faithfulness Scoring - Evaluates answer grounding in source context
Context Utilization - Measures how much context is used in answers
Automated Evaluation - Scheduled testing with regression detection
LangSmith Integration - Tracing, evaluation, and feedback collection

Claude Code Integration (v2.1.33)

chunk-chat Skill - Process conversations inline without manual export/staging/copy workflow
Standalone Chunker - Zero-dependency Python script replicates core pipeline (sentence splitting, overlap, metadata enrichment)
Direct Output - Chunks, transcript, sidecar, and origin manifest written to working directory
File or Context - Process a file path (/chunk-chat ./file.txt) or capture the current conversation automatically

Production Features

Graceful Degradation - Continues working even if RAG components fail
Error Handling - Robust error recovery and logging
Performance Monitoring - System metrics and performance tracking
Security Redaction - PII masking in metadata
Modular Architecture - Clean separation of concerns
JSON Sidecar (optional) - Per-file sidecar with chunk list, metadata, and Python code blocks

Windows "Send to" (Optional Helper)

To quickly drop files into 02_data via right‑click:

Press Win+R → type shell:sendto → Enter
Copy Chunker_MoveOptimized.bat to the SendTo folder
Right‑click any file → Send to → Chunker_MoveOptimized.bat

PowerShell Script: Chunker_MoveOptimized.ps1 + Chunker_MoveOptimized.bat

Moves files/folders from OneDrive or local folders to 02_data, preserving relative paths
Writes <filename>.origin.json manifest (original_full_path, times, size, sha256, optional hmac)
Automatically skips .origin.json manifest files to prevent processing loops
Handles OneDrive cloud files and reparse points using -Force parameter
Uses multi-method file detection for robust OneDrive compatibility
Watcher reads the manifest and populates sidecar origin (falls back if missing)

Features:

✅ OneDrive Support: Detects and processes OneDrive online-only files and reparse points
✅ Manifest Filtering: Automatically skips .origin.json metadata files
✅ Error Handling: Retries file removal with exponential backoff for OneDrive sync issues
✅ Cleanup Utility: Use cleanup_origin_files.ps1 to remove leftover manifest files from Desktop

Notes:

Discovery is recursive under 02_data and case-insensitive for extensions
Optional sidecar copy-back to source/ is enabled via copy_sidecar_to_source
If files remain on Desktop after "Send to", OneDrive may have restored them (check the error summary)

KB Operations (OneDrive)

Primary PC

Start watcher: powershell -File scripts/Start-Watcher.ps1
Stop watcher: powershell -File scripts/Stop-Watcher.ps1
Smoke test: powershell -File scripts/Smoke-Test.ps1
Health: powershell -File scripts/KB-Health.ps1
Run report: powershell -File tools/write_run_report.ps1
Config check: npm run kb:cfg:check
Analytics snapshot: npm run kb:analytics (pass -- --days 7 for weekly view)
Toggle dedupe: npm run kb:cfg:dedupe:on / npm run kb:cfg:dedupe:off
Toggle incremental updates: npm run kb:cfg:incr:on / npm run kb:cfg:incr:off
Consistency check: npm run kb:consistency

Secondary PC

Do not start the watcher.
Use streamlit run gui_app.py for search and answers.

Notes

Only one watcher process should run.
OneDrive folder must be set to Always keep on this device.
Duplicate protection is active through incremental updates and de-dup logic.
To auto-start on login, import scripts/KB_Watcher_StartOnLogin.xml in Task Scheduler (Action → Import Task) and confirm the action path points to C:\_chunker.
After import or any restart, run npm run kb:health to verify a single Running (PID=…) instance.
Weekly maintenance: import scripts/KB_Weekly_Dedupe.xml to schedule Monday 09:00 cleanups (dedupe + run report) or run manually with npm run kb:report.

🔄 Consolidation (2025-10-29)

New sidecar flags (config.json):
- enable_json_sidecar (default: true)
- enable_block_summary (default: true)
- enable_grok (default: false)

Sidecar schema (high-level):

file, processed_at, department, type, output_folder, transcript
chunks[]: filename, path, size, index
code_blocks[] (for .py): type, name, signature, start_line, end_line, docstring
Older project iterations (e.g., ClaudeExportFixer, chat_log_chunker_v1, chat_watcher) were unified under C:\_chunker.
Historical outputs migrated to C:\_chunker\04_output\<ProjectName>_<timestamp>.
Legacy artifacts captured once per project (latest snapshot only):
- Docs → 99_doc\legacy\<ProjectName>_<timestamp>
- Config → 06_config\legacy\<ProjectName>_<timestamp>
- Logs → logs\archive\ (and legacy 05_logs\ on older trees; use scripts\Archive-ChunkerLogs.ps1 to merge)
- DB/Backups → 03_archive\legacy\<ProjectName>_<timestamp>
Script backups stored with timestamp prefixes at C:\Users\carucci_r\OneDrive - City of Hackensack\00_dev\backup_scripts\<ProjectName>\.
Policy: keep only the latest legacy snapshot per project (older snapshots pruned).

⚙️ Configuration

Edit config.json to customize:

Core Settings

File filter modes: all, patterns, suffix
Supported file extensions: .txt, .md, .csv, .json, .yaml, .py, .m, .dax, .ps1, .sql, .pdf, .docx, .xlsx, .xls, .slx
Chunk sizes and processing options: sentence limits, overlap settings
Notification settings: email alerts and summaries

RAG Settings

rag_enabled: Enable/disable RAG functionality
ollama_model: Ollama embedding model (default: nomic-embed-text)
faiss_persist_dir: FAISS index storage directory
chroma_persist_dir: ChromaDB storage directory (optional)

LangSmith Settings (Optional)

langsmith_api_key: Your LangSmith API key
langsmith_project: Project name for tracing
tracing_enabled: Enable/disable tracing
evaluation_enabled: Enable/disable evaluation

Monitoring Settings

debounce_window: File event debouncing time (seconds)
use_ready_signal: Wait for <filename>.ready markers before processing atomic pushes
failed_dir: Directory for failed file processing quarantine
max_workers: Maximum parallel processing workers

Vector Store Settings

batch.size: Number of chunks to insert into ChromaDB per batch (default 500)
batch.flush_every: Optional flush cadence for very large ingest jobs
batch.mem_soft_limit_mb: Soft memory cap for batching helper
search.ef_search: Overrides HNSW search ef after each batch to rebalance recall vs. latency

🔍 RAG Usage

Setup

Install Dependencies: python install_rag_dependencies.py
Install Ollama: Download from ollama.ai
Pull Model: ollama pull nomic-embed-text
Enable RAG: Set "rag_enabled": true in config.json
Start Processing: python watcher_splitter.py

Search Knowledge Base

Interactive Search

python rag_search.py

Command Line Search

# Single query
python rag_search.py --query "How do I fix vlookup errors?"

# Batch search
python rag_search.py --batch queries.txt --output results.json

# Different search types
python rag_search.py --query "Excel formulas" --search-type semantic
python rag_search.py --query "vlookup excel" --search-type keyword

GUI Search

streamlit run gui_app.py

Opens a browser interface for entering queries, browsing results, and viewing knowledge-base statistics.

Programmatic Search

from ollama_integration import initialize_ollama_rag

# Initialize RAG system
rag = initialize_ollama_rag()

# Search
results = rag.hybrid_search("How do I fix vlookup errors?", top_k=5)

# Display results
for result in results:
    print(f"Score: {result['score']:.3f}")
    print(f"Content: {result['content'][:100]}...")
    print(f"Source: {result['metadata']['source_file']}")

Example Output

Interactive Search Session:

RAG Search Interface
==================================================
Commands:
  search <query> - Search the knowledge base
  semantic <query> - Semantic similarity search
  keyword <query> - Keyword-based search
  stats - Show knowledge base statistics
  quit - Exit the interface

RAG> search How do I fix vlookup errors?

Search Results for: 'How do I fix vlookup errors?'
==================================================

1. Score: 0.847 (semantic)
   Source: excel_guide.md
   Type: .md
   Content: VLOOKUP is used to find values in a table. Syntax: VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup]). Use FALSE for exact matches...
   Keywords: vlookup, excel, formula, table

2. Score: 0.723 (semantic)
   Source: troubleshooting.xlsx
   Type: .xlsx
   Content: Common VLOOKUP errors include #N/A when lookup value not found, #REF when table array is invalid...
   Keywords: vlookup, error, troubleshooting, excel

Search completed in 0.234 seconds
Found 2 results

📊 Evaluation & Testing

Automated Evaluation

# Run comprehensive evaluation
python automated_eval.py

# Run specific tests
python rag_test.py

# Generate evaluation report
python -c "from automated_eval import AutomatedEvaluator; evaluator = AutomatedEvaluator({}); evaluator.generate_csv_report()"

Manual Evaluation

from rag_evaluation import RAGEvaluator
from rag_integration import FaithfulnessScorer

# Initialize evaluator
evaluator = RAGEvaluator()

# Evaluate retrieval quality
retrieval_metrics = evaluator.evaluate_retrieval(
    retrieved_docs=["doc1.md", "doc2.xlsx"],
    relevant_docs=["doc1.md", "doc2.xlsx", "doc3.pdf"],
    k_values=[1, 3, 5]
)

# Evaluate generation quality
generation_metrics = evaluator.evaluate_generation(
    reference="Check data types and table references",
    generated="Verify data types and table references for vlookup errors"
)

# Evaluate faithfulness
scorer = FaithfulnessScorer()
faithfulness_score = scorer.calculate_faithfulness(
    answer="VLOOKUP requires exact data types",
    context="VLOOKUP syntax requires exact data type matching"
)

print(f"Precision@5: {retrieval_metrics['precision_at_5']:.3f}")
print(f"ROUGE-1: {generation_metrics['rouge1']:.3f}")
print(f"Faithfulness: {faithfulness_score:.3f}")

LangSmith Integration

from langsmith_integration import initialize_langsmith

# Initialize LangSmith
langsmith = initialize_langsmith(
    api_key="your_api_key",
    project="chunker-rag-eval"
)

# Create evaluation dataset
test_queries = [
    {
        "query": "How do I fix vlookup errors?",
        "expected_answer": "Check data types and table references",
        "expected_sources": ["excel_guide.md", "troubleshooting.xlsx"]
    }
]

# Run evaluation
results = langsmith.run_evaluation(test_queries, rag_function)

📁 Supported File Types

Type	Extensions	Processing Method	Metadata Extracted
Text	.txt, .md	Direct text processing	Word count, sentences, keywords
Structured	.json, .csv, .yaml	Parsed structure	Schema, data types, samples
Office	.xlsx, .xls, .docx	Library extraction	Sheets, formulas, formatting
Code	.py, .ps1, .sql, .m, .dax	AST/parsing	Functions, classes, imports, docstrings
Documents	.pdf	Text extraction	Pages, metadata, text content
Models	.slx	Specialized extraction	Model structure, parameters

🛠️ Advanced Features

Real-time Monitoring

from watchdog_system import create_watchdog_monitor

# Initialize watchdog monitor
monitor = create_watchdog_monitor(config, process_callback)

# Start monitoring
monitor.start()

# Monitor stats
stats = monitor.get_stats()
print(f"Queue size: {stats['queue_size']}")
print(f"Processing files: {stats['processing_files']}")

Modular File Processing

from file_processors import process_excel_file, process_pdf_file

# Process specific file types
excel_content = process_excel_file("", "data.xlsx")
pdf_content = process_pdf_file("", "document.pdf")

Embedding Management

from embedding_helpers import EmbeddingManager

# Initialize embedding manager
manager = EmbeddingManager(chunk_size=1000, chunk_overlap=200)

# Process files for embedding
results = batch_process_files(file_paths, manager, extract_keywords_func)

🚀 Performance & Scalability

Parallel Processing: Multi-threaded file processing with configurable workers
Streaming: Large file support with memory-efficient streaming
Caching: FAISS index persistence for fast startup
Debouncing: Prevents duplicate processing of rapidly changing files
Graceful Degradation: Continues working even if optional components fail

🔧 Troubleshooting

Common Issues

Pydantic-Core Version Incompatible (Watcher/Manual Processing Crashes)
```
# ChromaDB/deduplication requires pydantic-core 2.41.5; 2.42.0 causes SystemError on import
pip install pydantic-core==2.41.5
```
If the watcher or manual_process_files.py crashes immediately with SystemError: pydantic-core version incompatible, run the above. Then restart the watcher or re-run manual processing. Current deduplication.py also treats that import failure like “Chroma unavailable” so the process may start without dedup until versions are fixed.
Start-Watcher.ps1 exits before Python runs
- KB_Shared must resolve: either %OneDriveCommercial% (or OneDrive / OneDriveConsumer) must expand to a path that contains KB_Shared, or config.json output_dir must expand to a real folder under KB_Shared (script derives KB_Shared as the parent of 04_output).
- Python must be on PATH as python or py (launcher uses py -3 as fallback).

ChromaDB Installation Fails (Windows)

# Use FAISS instead
pip install faiss-cpu
# Or install build tools
# Or use Docker deployment

Ollama Not Available

# Install Ollama from https://ollama.ai/
# Pull the model
ollama pull nomic-embed-text

Memory Issues with Large Files

# Enable streaming in config
"enable_streaming": true,
"stream_chunk_size": 1048576  # 1MB chunks

UnicodeEncodeError in PowerShell Logs (Windows)
```
# Switch console to UTF-8 before starting the watcher
chcp 65001
Set-Item env:PYTHONIOENCODING utf-8
python watcher_splitter.py
```
This prevents logging failures when filenames contain emoji or other non-ASCII characters.

Performance Optimization

Chunk Size: Adjust based on content type (75 for police, 150 for admin)
Parallel Workers: Set based on CPU cores (default: 4)
Debounce Window: Increase for slow file systems (default: 1s)
Index Persistence: Enable for faster startup after restart

📈 Monitoring & Analytics

Database Tracking: SQLite database with processing statistics
Session Metrics: Files processed, chunks created, performance metrics
Error Logging: Comprehensive error tracking and notification
System Metrics: CPU, memory, disk usage monitoring
RAG Metrics: Search performance, evaluation scores, user feedback

🤝 Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure all tests pass
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Ollama for local embedding models
FAISS for vector similarity search
LangChain for RAG framework
LangSmith for evaluation and tracing
Watchdog for file system monitoring

🔄 Version Control & GitHub

Git Repository

This project is version-controlled using Git and backed up to GitHub.

Remote Repository: https://github.com/racmac57/chunker_Web.git

Quick Git Commands

# Check status
git status

# Stage and commit changes
git add -A
git commit -m "Description of changes"

# Push to GitHub
git push origin main

# View commit history
git log --oneline -10

Files Excluded from Git

The following are automatically excluded via .gitignore:

Processed documents (99_doc/, 04_output/)
Archived files (03_archive/)
Database files (*.db, *.sqlite)
Log files (logs/, *.log)
Virtual environments (.venv/, venv/)
NLTK data (nltk_data/)
Temporary and backup files

Contributing via Git

Clone the repository: git clone https://github.com/racmac57/chunker_Web.git
Create a feature branch: git checkout -b feature-name
Make changes and commit: git commit -m "Feature: description"
Push to your fork and create a pull request

For detailed Git setup information, see GIT_SETUP_STATUS.md.

Directory Health

Last Cleanup: 2025-10-31 19:22:39
Items Scanned: 16595
Items Moved: 7
Items Deleted: 627
Snapshots Pruned: 0

Snapshot Policy: Keep only the latest legacy snapshot per project. Older snapshots are pruned during maintenance. Config backups follow the same policy.

Log Location: logs/archive/ (example historical run: logs/archive/maintenance/2025_10_31_19_16_35/)

Git Status: ✅ Repository initialized, connected to GitHub, and regularly backed up

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
.claude		.claude
.github		.github
01_scripts		01_scripts
02_data		02_data
06_config		06_config
07_docs		07_docs
99_doc		99_doc
archive/old_scripts		archive/old_scripts
backup_20251107		backup_20251107
chroma_db/b8f9abff-14b5-483b-a84a-6d28afae9cf3		chroma_db/b8f9abff-14b5-483b-a84a-6d28afae9cf3
claude_gui_package		claude_gui_package
docs		docs
evaluation		evaluation
grok_review_package		grok_review_package
infra		infra
ingest		ingest
kb		kb
output		output
scripts		scripts
search		search
source		source
static		static
templates		templates
tests		tests
tools		tools
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
2025_11_07_CHATGPT_tagging_fix.md		2025_11_07_CHATGPT_tagging_fix.md
2025_11_09_CHTGPT_git_issues.md		2025_11_09_CHTGPT_git_issues.md
2025_11_09_claude_code_prompt_chunker.txt		2025_11_09_claude_code_prompt_chunker.txt
2025_11_09_cursor_can_you_fetch_the_updated_from_g.md		2025_11_09_cursor_can_you_fetch_the_updated_from_g.md
2025_11_09_cursor_stabilize_watcher_and_kb_pipelin.md		2025_11_09_cursor_stabilize_watcher_and_kb_pipelin.md
2025_11_09_onedrive_cursor_chatlog_can_you_fetch_the_updated_from_g.md		2025_11_09_onedrive_cursor_chatlog_can_you_fetch_the_updated_from_g.md
CHANGELOG.md		CHANGELOG.md
CLAUDE_CODE_TASK_PROMPT.md		CLAUDE_CODE_TASK_PROMPT.md
CLAUDE_GUI_COLLABORATION_PROMPT.md		CLAUDE_GUI_COLLABORATION_PROMPT.md
CLAUDE_IMPLEMENTATION_GUIDE.md		CLAUDE_IMPLEMENTATION_GUIDE.md
Chunker_MoveOptimized.bat		Chunker_MoveOptimized.bat
Chunker_MoveOptimized.ps1		Chunker_MoveOptimized.ps1
Claude.md		Claude.md
Cursor_Prompt_Blindspots_Enhancement.md		Cursor_Prompt_Blindspots_Enhancement.md
DATABASE_IMPROVEMENTS.md		DATABASE_IMPROVEMENTS.md
ENTERPRISE_CHUNKER_SUMMARY.md		ENTERPRISE_CHUNKER_SUMMARY.md
FINAL_STATUS.md		FINAL_STATUS.md
FIXES_APPLIED_SUMMARY.md		FIXES_APPLIED_SUMMARY.md
GIT_SETUP_STATUS.md		GIT_SETUP_STATUS.md
GROK_COLLABORATION_PROMPT.md		GROK_COLLABORATION_PROMPT.md
GROK_HNSW_FIX_PROMPT.md		GROK_HNSW_FIX_PROMPT.md
GROK_REFERENCE_PACKAGE.md		GROK_REFERENCE_PACKAGE.md
GROK_REVIEW_PACKAGE.md		GROK_REVIEW_PACKAGE.md
GROK_SIMPLIFICATION_RECOMMENDATIONS.md		GROK_SIMPLIFICATION_RECOMMENDATIONS.md
IMPLEMENTATION_STATUS.md		IMPLEMENTATION_STATUS.md
LOG_REVIEW_AND_FIXES.md		LOG_REVIEW_AND_FIXES.md
MONITOR_DB_LOCKS.md		MONITOR_DB_LOCKS.md
MOVE_WORKFLOW_IMPLEMENTATION.md		MOVE_WORKFLOW_IMPLEMENTATION.md
QUICK_ACTION_CHECKLIST.md		QUICK_ACTION_CHECKLIST.md
QUICK_START.md		QUICK_START.md
QUICK_START_PROMPT.md		QUICK_START_PROMPT.md
README.md		README.md
RECOVERY_SUCCESS.md		RECOVERY_SUCCESS.md
RELEASE_NOTES_v2.1.24.md		RELEASE_NOTES_v2.1.24.md
REPOSITORY_STRUCTURE.md		REPOSITORY_STRUCTURE.md
SUMMARY.md		SUMMARY.md
TASK_PROGRESS_REPORT.md		TASK_PROGRESS_REPORT.md
UPDATE_SUMMARY_2025-11-08.md		UPDATE_SUMMARY_2025-11-08.md
VERIFICATION_REPORT.md		VERIFICATION_REPORT.md
_chunker.code-workspace		_chunker.code-workspace
advanced_celery_config.py		advanced_celery_config.py
analytics_cli.py		analytics_cli.py
api_server.py		api_server.py
ask.py		ask.py
automated_eval.py		automated_eval.py
backfill_knowledge_base.py		backfill_knowledge_base.py
backfill_tags.py		backfill_tags.py
backup_manager.py		backup_manager.py
celery_tasks.py		celery_tasks.py
chromadb_crud.py		chromadb_crud.py
chunk_and_tag.py		chunk_and_tag.py
chunker_cleanup.py		chunker_cleanup.py
chunker_db.py		chunker_db.py
chunker_spliter_rework_cmd_msg.md		chunker_spliter_rework_cmd_msg.md
cleanup_origin_files.ps1		cleanup_origin_files.ps1
cleanup_recursive_manifests.py		cleanup_recursive_manifests.py
comprehensive_eval.py		comprehensive_eval.py
config.json		config.json
config.json.backup_20251030_223559		config.json.backup_20251030_223559
conftest.py		conftest.py
deduplication.py		deduplication.py
embedding_helpers.py		embedding_helpers.py
enhanced_watchdog.py		enhanced_watchdog.py
file_processors.py		file_processors.py
gui_app.py		gui_app.py
handoff_brief.md		handoff_brief.md
incremental_updates.py		incremental_updates.py
install_rag_dependencies.py		install_rag_dependencies.py
job_integrity.py		job_integrity.py
kb_ask_ollama.py		kb_ask_ollama.py
langchain_rag_handler.py		langchain_rag_handler.py
langsmith_integration.py		langsmith_integration.py

Folders and files

Latest commit

History

Repository files navigation

Chunker v2.1.33 - Enterprise RAG-Powered File Processing System

What this repo does

Installing

Running the search surfaces

KB maintenance tools

What changed in v2.1.33

What changed in v2.1.32

What changed in v2.1.31

What changed in v2.1.30

What changed in v2.1.29

What changed in v2.1.28

What changed in v2.1.27

What changed in v2.1.26

What changed in v2.1.25

What changed in v2.1.24

What changed in v2.1.23

What changed in v2.1.22

What changed in v2.1.21

What changed in v2.1.20

What changed in v2.1.19

What changed in v2.1.18

What changed in v2.1.17

What's New in v2.1.8+

Recent Improvements (Post-v2.1.8)

v2.1.8 Release (2025-11-07)

🚀 What's New in v2.1.6

🚀 RAG Backfill Optimization

📊 Verification & Validation

🚀 What's New in v2.1.5

📦 Move-Based Workflow (Grok Recommendations)

🚀 What's New in v2.1.2

🚨 Critical Performance Fixes

🚀 Performance Enhancements

Directory Structure

🚀 Quick Start

Basic Usage (Core Chunking)

Advanced Usage (RAG-Enabled)

Advanced Usage (Celery-Enabled)

⚙️ Feature Toggles & Setup

Metadata Enrichment (metadata_enrichment)

Monitoring & Health Checks (monitoring)

Database Lock Monitoring

Tiny File Handling

Deduplication (deduplication)

Query Cache (query_cache)

Incremental Updates (incremental_updates)

Backup Manager (backup)

✨ Features

Core Chunking

RAG System (v2.0)

Performance & Scalability (v2.1.2)

Evaluation & Quality Assurance

Claude Code Integration (v2.1.33)

Production Features

Windows "Send to" (Optional Helper)

KB Operations (OneDrive)

🔄 Consolidation (2025-10-29)

⚙️ Configuration

Core Settings

RAG Settings

LangSmith Settings (Optional)

Monitoring Settings

Vector Store Settings

🔍 RAG Usage

Setup

Search Knowledge Base

Interactive Search

Command Line Search

GUI Search

Programmatic Search

Example Output

📊 Evaluation & Testing

Automated Evaluation

Manual Evaluation

LangSmith Integration

📁 Supported File Types

Metadata Enrichment (`metadata_enrichment`)

Monitoring & Health Checks (`monitoring`)

Deduplication (`deduplication`)

Query Cache (`query_cache`)

Incremental Updates (`incremental_updates`)

Backup Manager (`backup`)

Packages