Skip to content

racmac57/chunker_Web

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

106 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Chunker v2.1.33 - Enterprise RAG-Powered File Processing System

For AI/Claude project context and workflow guidance, see Claude.md.

What this repo does

Four concerns, now mapped to named sub-packages so the scope boundaries are legible without relocating files:

Purpose Sub-package Key modules
1. Ingest AI chat logs + chunk for archival ingest/ watcher_splitter.py, file_processors.py, metadata_enrichment.py
2. Feed prior conversation context back to AI search/ ask.py, kb_ask_ollama.py, rag_integration.py
3. AI-interactive file search search/ gui_app.py (Streamlit), api_server.py (FastAPI), rag_search.py
4. Knowledge base management kb/ backfill_knowledge_base.py, chromadb_crud.py, deduplication.py, backup_manager.py

Shared plumbing (SQLite pool, monitoring, watchdog debounce, job integrity, query cache) lives in infra/. The optional RAG evaluation harness lives in evaluation/ (install with pip install -r requirements-eval.txt).

The sub-package __init__.py files re-export the existing root modules; this is a soft boundary layer, not a rewrite. Existing scripts continue to work.

Installing

pip install -r requirements.txt          # minimal core (watcher + tests)
pip install -r requirements-app.txt      # full RAG stack + search UIs
pip install -r requirements-eval.txt     # optional RAG evaluation harness
pip install -r requirements-dev.txt      # linters + test extras

Running the search surfaces

streamlit run gui_app.py           # browser UI
uvicorn api_server:app --reload    # REST API (docs at /docs)
python ask.py "your query"         # CLI

KB maintenance tools

Tag upkeep:

  • tools/audit_tags.py β€” audit ChromaDB tag metadata against the canonical taxonomy.
  • tools/retag_chromadb.py β€” re-run metadata_enrichment.enrich_metadata and upsert tags (dry-run by default; pass --apply).

KB cleanup β€” directory-level (primary pipeline; one conversation = one directory):

  1. tools/kb_inventory.py --root "%OneDriveCommercial%\KB_Shared\04_output" --skip-hash β€” walk the KB, emit file-level CSV including a conversation_dir column.
  2. tools/kb_directory_staleness.py β€” group by base project slug across conversation_dirs, flag older duplicates. Marks safe_archive when every file hash in the older dir appears in the newer one; otherwise review.
  3. tools/kb_archive_dirs.py --root <root> β€” dry-run first; add --apply to move whole directories to <root>/_archive/YYYY-MM-DD-dir-cleanup/ and delete their chunk ids from ChromaDB. Rollback manifest written next to the archive.

KB cleanup β€” file-level (edge cases where dirs don't apply):

  • tools/kb_staleness_report.py and tools/kb_archive.py β€” same pipeline but at file granularity. Use when conversation_dirs are flat.

Production KB lives at %OneDriveCommercial%\KB_Shared\04_output (see config.json). Sidecar JSON generation is now disabled by default ("enable_json_sidecar": false); ChromaDB metadata is the source of truth.

What changed in v2.1.33

  • Claude Code chunk-chat Skill: New skill at .claude/skills/chunk-chat/ replicates the core chunker pipeline directly inside a Claude Code session, eliminating the manual workflow of exporting chat logs, staging them in 02_data/, and copying output back.
    • .claude/scripts/chat_chunker.py: Standalone Python chunker (stdlib only, zero external dependencies) with sentence-based splitting, overlap, tag detection, key term extraction, and full output artifact generation (chunks, transcript, sidecar JSON, origin manifest).
    • .claude/skills/chunk-chat/SKILL.md: Skill definition that captures conversation context and runs the chunker inline. Invoke with /chunk-chat or say "chunk this chat".
    • Output lands in ./chunked_chat/ in the current working directory -- no staging or copying required.

What changed in v2.1.32

See CHANGELOG.md for full details.

  • Logging layout: All watcher-related logs under logs\ (including Task Scheduler / silent start). Use scripts\Archive-ChunkerLogs.ps1 to move legacy 05_logs\ or rotated watcher_archive_*.log into logs\archive\.
  • KB-PathSafetyReport.ps1: Fixed watcher PID check (do not assign to $pid).

What changed in v2.1.31

See CHANGELOG.md for full details.

  • SQLite logging: chunker_db.log_processing() commits processing_history before updating department_stats, avoiding nested-connection lock contention that could delay or hang manual_process_files.py after a successful job.
  • Documentation: Version sync across core docs; clarified manual_process_files.py --auto (scans watch_folder from config.json); troubleshooting for "Department stats update locked" / apparent manual-run hangs.

What changed in v2.1.30

See CHANGELOG.md for full details.

  • Documentation: Watcher startup docs updated for Start-Watcher.ps1 (KB_Shared detection from config.json + OneDrive env fallbacks, python / py -3 launcher), deduplication/ChromaDB import resilience notes, corrected log paths in quick reference, troubleshooting addendum in 07_docs/File_Processing_Investigation_Report.md.

What changed in v2.1.29

See CHANGELOG.md for full details.

  • Documentation Updates: Aligned all project documentation with current code and configuration. Fixed broken doc references in Claude.md, updated supported extensions and query_cache config in README, corrected config settings in SUMMARY.

What changed in v2.1.28

See CHANGELOG.md for full details.

  • Incremental Updates Skip Logic: Unchanged files are now skipped when incremental updates are enabled. VersionTracker.has_changed() is consulted before reprocessing; matching content hash returns early.
  • Test Suite Fixes: All 76 tests pass. Fixed test_incremental_updates_skip_reprocessing (recreate file after archiving) and job integrity fixtures (artifact sizes β‰₯ 50 bytes).

What changed in v2.1.27

See CHANGELOG.md for full details.

  • Pydantic-Core Compatibility Fix: Resolved watcher startup crash caused by pydantic-core 2.42.0 incompatibility with pydantic 2.12.x. ChromaDB/deduplication requires pydantic-core==2.41.5. Fix: pip install pydantic-core==2.41.5
  • Documentation Updates: Corrected log paths across project docs (logs/ for watcher.log and watcher_start.log; 05_logs/ for silent start). Updated troubleshooting for stale watcher PID, pydantic crashes, and file processing issues.
  • KB-Health Script: Health check reports Watcher status, OneDrive paths, ChromaDB presence, and directory accessibility. Output/archive paths read from config.json.

What changed in v2.1.26

See CHANGELOG.md for full details.

  • Job Integrity Validation System: Comprehensive post-processing validation prevents incomplete outputs from being archived as successful jobs
    • Validates all required artifacts (chunks, transcript, sidecar, manifest) with size thresholds
    • File stability checks prevent OneDrive sync race conditions (2s window, 30s timeout)
    • Intelligent retry logic for timing-related failures (15s delay, max 1 retry)
    • Detailed failure reports with full diagnostics in {job_id}.integrity_fail.log
    • Artifact quarantine moves failed job files to failed/ folder
    • Multi-channel notifications (Windows toast, log markers, console)
    • 19 unit tests (all passing), comprehensive documentation
    • 100% backward compatible - can be disabled via config
  • Enhanced Logging: Start-Watcher.ps1 now shows startup banner with all monitored paths
  • ChromaDB Metadata Fix: Fixed validation error by converting lists to strings and handling None values

What changed in v2.1.25

See CHANGELOG.md for full details.

  • Orphaned Manifest File Fix: Fixed bug where .origin.json manifest files were left in 02_data after processing. Both archive and quarantine functions now properly move manifest files.
  • Cleanup Script: Added Cleanup-Orphaned-Manifests.ps1 to detect and archive orphaned manifests with dry-run and verbose modes.

What changed in v2.1.24

See CHANGELOG.md for full details.

  • KB Consolidation Complete: Backfilled 8,136 chunks to ChromaDB knowledge base (8,142 total). Full semantic search now operational with verified query results.
  • Pydantic 2.x Compatibility Fix: Fixed rag_integration.py to work with pydantic 2.12+ by manually injecting private attributes into ChromaDB Collection objects.
  • Output Consolidation Script: Added Move-Unique-Local-Output-To-OneDrive.ps1 for consolidating local output folders to OneDrive KB_Shared.
  • Documentation Consolidation: Archived duplicate docs from 07_docs/ to 07_docs/archived_20260130/. Root directory now has single canonical versions of all key documentation files.
  • Config.json Enhanced: Merged additional settings from 07_docs config - now supports 15 file types (added pdf, docx, xlsx, yaml, etc.), auto KB insertion, Ollama embedding config, and performance tuning options.
  • Completion Report: Added 07_docs/KB_CONSOLIDATION_COMPLETION_REPORT.md and 07_docs/DOC_CONSOLIDATION_REPORT_20260130.md documenting consolidation processes.

What changed in v2.1.23

See CHANGELOG.md for full details.

  • ChromaDB backfill and verify: Pydantic 2.12+ and ChromaDB 0.3.x support. Dummy-embedding fallback when sentence_transformers unavailable. Backfill and verify scripts updated.
  • Chunk ID sanitization: Backfill sanitizes chunk IDs for ChromaDB/DuckDB so quotes and special chars do not break SQL.
  • Ollama RAG: Ollama integration tolerates missing deps with clear install message. Scripts to pull nomic-embed-text and run backfill with real embeddings (Python 3.11 venv).
  • Department migration: 20+ departments and priority-based detection from laptop source. Department as fallback for archive (flat 03_archive when default).
  • ChromaDB guide: Optional real embeddings, RAG search (Ollama + FAISS), chunk ID sanitization, Ollama on Windows.

What changed in v2.1.22

See CHANGELOG.md for full details.

  • Unicode Encoding Fix: Fixed Unicode encoding errors in manual_process_files.py when running on Windows console
    • Replaced Unicode checkmark/cross characters with ASCII-safe alternatives
    • Added UTF-8 encoding handling for Windows console output
    • Fixed file existence check to prevent errors during batch processing

What changed in v2.1.21

See CHANGELOG.md for full details.

  • File Processing Investigation Report: Added comprehensive troubleshooting guide for unprocessed files
    • Root cause analysis for files not being processed by the chunker
    • Configuration verification procedures
    • Manual processing solutions using manual_process_files.py
    • Watcher status checking and restart procedures
    • Documentation at 07_docs/File_Processing_Investigation_Report.md

What changed in v2.1.20

See CHANGELOG.md for full details.

  • OneDrive SYNC Folder Complete Removal: Successfully removed SYNC folder from both local and cloud
    • Created Move-SYNC-To-Temp-And-Delete.ps1 - Moves SYNC contents to C:\TEMP and deletes empty SYNC folder
    • Successfully moved 35 items (41.85 MB) to backup and deleted SYNC folder locally
    • User deleted SYNC folder from web interface, triggering OneDrive deletion sync
    • OneDrive processing 178,824 file deletions (30-60 minutes estimated)
    • Additional troubleshooting scripts for stuck sync operations and folder blocking issues

What changed in v2.1.19

See CHANGELOG.md for full details.

  • OneDrive SYNC Directory Removal: Added automated scripts to remove problematic SYNC directories causing recurring "path is too long" sync errors
    • Safely removes 4 directories (565,962 files, ~165 GB) from OneDrive sync
    • Uses robocopy for better OneDrive cloud file handling, then deletes directories
    • Files moved to backup location initially, then deleted after verification
    • Complete documentation and troubleshooting guides included
    • Connection issue resolution scripts and monitoring tools added
    • Cloud sync status checking and laptop sync guidance provided

What changed in v2.1.18

See CHANGELOG.md for full details.

  • OneDrive Desktop Auto-Repair: Added comprehensive PowerShell scripts to detect and fix Desktop misalignment between Windows and OneDrive, resolving "dual desktop" sync issues
  • OneDrive Desktop Post-Verification Monitor: Added continuous monitoring script (OneDrive_Desktop_PostVerify_Monitor.ps1) that checks every 30 seconds until all Desktop alignment checks pass, with color-coded dashboard, sync status analysis, status histogram, and auto-exit on success
  • Desktop Path Detection Fix: Fixed false negative in monitor script by using Windows Known Folder API to correctly detect OneDrive-redirected Desktop paths
  • Sync Status Improvements: Enhanced sync status reporting to properly identify folders vs files and provide status breakdown histogram

What changed in v2.1.17

See CHANGELOG.md for full details.

  • Fixed 85+ recursive paths causing OneDrive sync failures, including SB_160116 hall of mirrors issue

Version 2.1.19 - OneDrive SYNC directory removal scripts to fix recurring path length sync errors, with robocopy-based file handling and comprehensive documentation.

Version 2.1.18 - OneDrive Desktop Auto-Repair scripts with continuous monitoring, Known Folder API path detection fix, and enhanced sync status reporting.

Version 2.1.17 - ChromaDB rebuilt with compatibility fixes, streamlined release automation, refreshed documentation, comprehensive regression coverage, plus watcher stability and SQLite hardening.

What's New in v2.1.8+

Recent Improvements (Post-v2.1.8)

  • Tiny File Archiving: Files under 100 bytes are automatically parked in 03_archive/skipped_files/ with their manifests to eliminate endless β€œtoo small” retries.
  • Manifest & Hash Safety: Watcher now skips any file containing .origin.json in its name and recomputes content hashes when the manifest is missing a checksum so incremental tracking remains intact.
  • Chunk Writer Hardening: Consolidated write_chunk_files() helper creates the directory once, writes UTF-8 chunks with defensive logging, and copy_manifest_sidecar() guarantees parent folders exist before copying manifests.
  • Parallel Queue Handling: Added optional multiprocessing.Pool batches for queues β‰₯32 files (config flag), plus automatic pruning of the processed_files set to prevent long-running watcher stalls.
  • Tokenizer & Metrics Optimizations: Sentence tokenization is LRU-cached, system metrics run on a background executor, and notification bursts are throttled with a 60-second rate limiter per alert key.
  • SQLite Resilience: Centralized _conn() helper sets 60β€―s timeouts, log_error() now understands both legacy signatures and retries lock errors, and run_integrity_check() validates the DB at startup.
  • Test Coverage & Pytest Guardrails: Root conftest.py skips bulky 99_doc/legacy suites and tests/test_db.py smoke-tests the new retry path to ensure future regressions fail fast.
  • Database Lock Monitoring: MONITOR_DB_LOCKS.md documents command-line checks, baseline metrics (1.5 errors/min), and alert thresholds (3 errors/min = 2Γ— baseline).
  • Watcher Bridge Support: watcher_splitter.py understands .part staging files, waits for optional .ready signals, retries processing up to three times, and quarantines stubborn failures to 03_archive/failed/.
  • Batched Chroma Ingest: ChromaRAG.add_chunks_bulk() honours batch.size, skips null embeddings, and refreshes hnsw:search_ef from config.json so the vector store keeps pace with high-volume ingest.

v2.1.8 Release (2025-11-07)

  • ChromaDB Rebuild: Upgraded to chromadb 1.3.4, recreated the collection, and re-ran the backfill so 2,907 enriched chunks are in sync with the latest pipeline.
  • Dedup Reliability: deduplication.py now ships with hnswlib compatibility shims, letting python deduplication.py --auto-remove complete without legacy metadata errors.
  • Release Helper: scripts/release_commit_and_tag.bat automates doc staging, backups, commit/tag creation, and pushes while rotating logs; the 2025-11-07 dry run and live validation are logged in docs/RELEASE_WORKFLOW.md.
  • Regression Tests: Replaced placeholder suites with 52-case pytest coverage for query caching, incremental updates, backup management, and monitoring to mirror the production APIs.
  • Watcher & DB Resilience (Novβ€―2025): Skips manifests/archives/output files, sanitises output folder names, replaces Unicode logging arrows, adds safe archive moves, and introduces exponential-backoff SQLite retries to squash recursion, path-length, and β€œdatabase locked” errors.

What changed in v2.1.8? See the [changelog entry](./CHANGELOG.md#v2.1.17 - --2025-11-07---chromadb-rebuild--release-automation).

πŸš€ What's New in v2.1.6

πŸš€ RAG Backfill Optimization

  • Multiprocessing: Parallel file processing and ChromaDB inserts with 4-8 workers
  • Performance: 20x faster backfill (100-200 chunks/second vs 5 chunks/second)
  • Batch Optimization: Optimized batch sizes (500-1000 chunks) for ChromaDB efficiency
  • HNSW Tuning: Proper vector index configuration (M=32, ef_construction=512, ef_search=200)
  • CPU Monitoring: Real-time CPU tracking with saturation alerts
  • Duplicate Detection: Pre-insertion verification prevents duplicate chunks
  • Verification Tools: Comprehensive scripts to verify backfill completeness

πŸ“Š Verification & Validation

  • Empty Folder Logging: Identifies folders without chunk files
  • Count Discrepancy Alerts: Warns when expected vs actual counts differ
  • Chunk Completeness Verification: Validates all chunks from all folders are in KB
  • Performance Metrics: Detailed throughput, memory, and CPU statistics

πŸš€ What's New in v2.1.5

πŸ“¦ Move-Based Workflow (Grok Recommendations)

  • ⚑ Storage Optimization: Reduced storage overhead by 50-60% via MOVE operations instead of COPY
  • πŸ”— OneDrive Sync Elimination: 100% reduction in sync overhead by moving files out of OneDrive
  • πŸ“‹ Manifest Tracking: Complete origin tracking with .origin.json files
  • πŸ”„ Enhanced Archive: MOVE with 3 retry attempts and graceful fallback to COPY
  • 🎯 Department as fallback: Archive uses department subfolders only when department is explicit (path or enrichment); default department files go to flat 03_archive/
  • πŸ” Smart Retry Logic: Handles Windows permission issues with automatic retries

πŸš€ What's New in v2.1.2

🚨 Critical Performance Fixes

  • ⚑ Processing Loop Resolution: Fixed infinite loops that caused system hangs
  • πŸ“ Smart File Archiving: Failed files automatically moved to organized archive folders
  • πŸ”’ Database Stability: Eliminated "database is locked" errors with batch operations
  • ⚑ 8-12x Speed Improvement: Dynamic parallel workers and optimized processing

πŸš€ Performance Enhancements

  • πŸ” Advanced RAG System: Ollama + FAISS for local embeddings and semantic search
  • πŸ“Š Comprehensive Evaluation: Precision@K, Recall@K, MRR, ROUGE, BLEU, Faithfulness scoring
  • πŸ”— LangSmith Integration: Tracing, evaluation, and feedback collection
  • ⚑ Real-time Monitoring: Watchdog-based file system monitoring with debouncing
  • πŸ€– Hybrid Search: Combines semantic similarity with keyword matching
  • πŸ“ˆ Automated Evaluation: Scheduled testing with regression detection
  • πŸ›‘οΈ Production Ready: Graceful degradation, error handling, and monitoring
  • πŸ“‚ Source Folder Copying: Configurable copying of processed files back to source locations

Directory Structure

  • C:/_chunker - Main project directory with scripts
  • .claude/ - Claude Code skill and scripts (chunk-chat skill, chat_chunker.py)
  • 02_data/ - Input files to be processed (watch folder)
  • 03_archive/ - Archived original files (in OneDrive KB_Shared when configured)
  • 03_archive/skipped_files/ - Files too small to process (< 100 bytes) - automatically archived
  • 04_output/ - Generated chunks and transcripts (in OneDrive KB_Shared when configured)
  • logs/ - All runtime logs (watcher.log, watcher_start.log, watcher_start_silent.log, manual_process.log); older material under logs/archive/ after scripts\Archive-ChunkerLogs.ps1
  • 06_config/ - Configuration files
  • 99_doc/legacy/ - Consolidated legacy docs (latest snapshot per project)
  • 06_config/legacy/ - Consolidated legacy config (latest snapshot per project)
  • logs/archive/ - Migrated or rotated logs (optional; created by archive script)
  • 03_archive/legacy/ - Consolidated legacy db/backups (latest snapshot per project)
  • chroma_db/ - ChromaDB vector database storage
  • faiss_index/ - FAISS vector database storage
  • evaluations/ - RAG evaluation results
  • reports/ - Automated evaluation reports

πŸš€ Quick Start

Basic Usage (Core Chunking)

  1. Place files to process in 02_data/ folder
  2. Run the watcher: python watcher_splitter.py
  3. Check 04_output/ for processed chunks and transcripts
  4. Original files are moved to 03_archive/ after processing

Advanced Usage (RAG-Enabled)

  1. Install RAG dependencies: python install_rag_dependencies.py
  2. Install Ollama and pull model: ollama pull nomic-embed-text
  3. Enable RAG in config.json: Set "rag_enabled": true
  4. Run the watcher: python watcher_splitter.py
  5. Search knowledge base: python rag_search.py

Advanced Usage (Celery-Enabled)

For high-volume processing and advanced task management:

  1. Install Celery Dependencies:

    pip install celery redis flower
  2. Start Redis Server:

    # Windows: Download from https://github.com/microsoftarchive/redis/releases
    redis-server
    
    # Linux: sudo apt-get install redis-server
    # macOS: brew install redis
  3. Start Celery Services:

    # Option A: Use orchestrator (recommended)
    python orchestrator.py
    
    # Option B: Start manually
    celery -A celery_tasks worker --loglevel=info --concurrency=4
    celery -A celery_tasks beat --loglevel=info
    celery -A celery_tasks flower --port=5555
    python enhanced_watchdog.py
  4. Monitor Tasks:

    • Flower Dashboard: http://localhost:5555 (with authentication)
    • Celery CLI: celery -A celery_tasks inspect active
    • Logs: Check logs/watcher.log
  5. Security & Priority Features:

    • Flower Authentication: Default credentials logged on startup
    • Priority Queues: High-priority processing for legal/police files
    • Redis Fallback: Automatic fallback to direct processing if Redis fails
    • Task Timeouts: 300s hard limit with graceful handling
  6. Configuration:

    {
      "celery_enabled": true,
      "celery_broker": "redis://localhost:6379/0",
      "celery_task_time_limit": 300,
      "celery_worker_concurrency": 4,
      "priority_departments": ["legal", "police"]
    }
  7. Environment Variables (Optional):

    export FLOWER_USERNAME="your_username"
    export FLOWER_PASSWORD="your_secure_password"

βš™οΈ Feature Toggles & Setup

All new subsystems ship disabled by default so existing deployments behave exactly as before. Enable individual features by updating config.json in the project root.

Metadata Enrichment (metadata_enrichment)

  • Adds semantic tags, key terms, summaries, and source metadata to chunk sidecars and manifests.
  • Output schema documented in docs/METADATA_SCHEMA.md.
  • Enable with:
    "metadata_enrichment": { "enabled": true }

Monitoring & Health Checks (monitoring)

  • Background thread performs disk, throughput, and ChromaDB checks and escalates via the notification system.
  • Configure thresholds in config.json under the monitoring section; default recipients come from notification_system.py.
  • Start by setting "monitoring": { "enabled": true }.

Database Lock Monitoring

  • Current Performance: 1.5 database lock errors/minute baseline (68% reduction from previous)
  • Monitoring Documentation: See MONITOR_DB_LOCKS.md for comprehensive monitoring commands and alert thresholds
  • Real-time Monitoring:
    # Watch for lock errors in real-time
    powershell -Command "Get-Content watcher_live.log -Wait | Select-String -Pattern 'Failed to log|database is locked'"
    
    # Check hourly error count
    powershell -Command "(Get-Content watcher_live.log | Select-String -Pattern 'Failed to log processing' | Select-Object -Last 100 | Measure-Object).Count"
  • Alert Threshold: Flag if errors exceed 3/minute (2x current baseline)
  • Review Schedule: Monitor every 8-12 hours using commands in MONITOR_DB_LOCKS.md
  • Key Findings:
    • 92% of lock errors occur in log_processing() (lacks retry wrapper)
    • 8% in _update_department_stats() (has 5-retry exponential backoff)
    • Current retry config: get_connection (3 retries), dept_stats (5 retries with 1.5x backoff)

Tiny File Handling

  • Automatic Archiving: Files under 100 bytes (default min_file_size_bytes) are automatically moved to 03_archive/skipped_files/
  • Examples: Empty files, "No measures found" messages, test placeholders
  • Behavior: Files are preserved with their .origin.json manifests for review rather than deleted or left to trigger repeated warnings
  • Configuration: Adjust threshold in department config via min_file_size_bytes parameter (default: 100)
  • Logs: Look for [INFO] File too short (X chars), archiving: filename messages

Deduplication (deduplication)

  • Prevents duplicate chunks from entering ChromaDB in both watcher and backfill flows.
  • Optionally run cleanup via python deduplication.py --auto-remove.
  • Already present in config.json; flip "enabled": true to activate.

Query Cache (query_cache)

  • Enables an in-memory LRU + TTL cache in rag_integration.py so repeat queries avoid hitting ChromaDB.
  • Configure ttl_seconds, max_entries under the query_cache section.
  • API users can inspect runtime metrics via GET /api/cache/stats once the cache is enabled.
  • Example from config.json:
    "query_cache": {
      "enabled": true,
      "ttl_seconds": 600,
      "max_entries": 512
    }

Incremental Updates (incremental_updates)

  • Uses a shared VersionTracker to hash inputs, skip untouched files, remove old chunk IDs, and persist deterministic chunk identifiers.
  • Tracker state defaults to 06_config/file_versions.json (override with version_file).
  • Typical configuration:
    "incremental_updates": {
      "enabled": true,
      "version_file": "06_config/file_versions.json",
      "hash_algorithm": "sha256"
    }
  • After enabling, unchanged files are skipped by the watcher/backfill, while reprocessed sources clean up stale artifacts before writing new chunks.

Backup Manager (backup)

  • Creates compressed archives of ChromaDB and critical directories on a schedule.
  • Configure destination, retention, and schedule in the backup section.
  • Manual run: python backup_manager.py --config config.json create --label on-demand.

After toggling features, restart the watcher (python watcher_splitter.py) so runtime components reinitialize with the new configuration.

✨ Features

Core Chunking

  • Organized output by source file name with timestamp prefixes
  • Multi-file type support - .txt, .md, .csv, .json, .yaml, .py, .m, .dax, .ps1, .sql, .pdf, .docx, .xlsx, .xls, .slx
  • Unicode filename support - Handles files with emojis, special characters, and symbols
  • Enhanced filename sanitization - Automatically cleans problematic characters
  • Database tracking and logging - Comprehensive activity monitoring
  • Automatic file organization - Moves processed files to archive

RAG System (v2.0)

  • Ollama Integration - Local embeddings with nomic-embed-text model
  • FAISS Vector Database - High-performance similarity search
  • Hybrid Search - Combines semantic similarity with keyword matching
  • ChromaDB Support - Alternative vector database (optional)
  • Real-time Monitoring - Watchdog-based file system monitoring
  • Debounced Processing - Prevents race conditions and duplicate processing

Performance & Scalability (v2.1.2)

  • Dynamic Parallel Processing - Up to 12 workers for large batches (50+ files)
  • Batch Processing - Configurable batch sizes with system overload protection
  • Database Optimization - Batch logging eliminates locking issues
  • Smart File Archiving - Failed files automatically moved to organized folders
  • Real-time Performance Metrics - Files/minute, avg processing time, peak CPU/memory
  • 500+ File Capability - Handles large volumes efficiently without loops or crashes
  • Source Folder Copying - Configurable copying of processed files back to source locations

Evaluation & Quality Assurance

  • Comprehensive Metrics - Precision@K, Recall@K, MRR, NDCG@K
  • Generation Quality - ROUGE-1/2/L, BLEU, BERTScore
  • Faithfulness Scoring - Evaluates answer grounding in source context
  • Context Utilization - Measures how much context is used in answers
  • Automated Evaluation - Scheduled testing with regression detection
  • LangSmith Integration - Tracing, evaluation, and feedback collection

Claude Code Integration (v2.1.33)

  • chunk-chat Skill - Process conversations inline without manual export/staging/copy workflow
  • Standalone Chunker - Zero-dependency Python script replicates core pipeline (sentence splitting, overlap, metadata enrichment)
  • Direct Output - Chunks, transcript, sidecar, and origin manifest written to working directory
  • File or Context - Process a file path (/chunk-chat ./file.txt) or capture the current conversation automatically

Production Features

  • Graceful Degradation - Continues working even if RAG components fail
  • Error Handling - Robust error recovery and logging
  • Performance Monitoring - System metrics and performance tracking
  • Security Redaction - PII masking in metadata
  • Modular Architecture - Clean separation of concerns
  • JSON Sidecar (optional) - Per-file sidecar with chunk list, metadata, and Python code blocks

Windows "Send to" (Optional Helper)

To quickly drop files into 02_data via right‑click:

  1. Press Win+R β†’ type shell:sendto β†’ Enter
  2. Copy Chunker_MoveOptimized.bat to the SendTo folder
  3. Right‑click any file β†’ Send to β†’ Chunker_MoveOptimized.bat

PowerShell Script: Chunker_MoveOptimized.ps1 + Chunker_MoveOptimized.bat

  • Moves files/folders from OneDrive or local folders to 02_data, preserving relative paths
  • Writes <filename>.origin.json manifest (original_full_path, times, size, sha256, optional hmac)
  • Automatically skips .origin.json manifest files to prevent processing loops
  • Handles OneDrive cloud files and reparse points using -Force parameter
  • Uses multi-method file detection for robust OneDrive compatibility
  • Watcher reads the manifest and populates sidecar origin (falls back if missing)

Features:

  • βœ… OneDrive Support: Detects and processes OneDrive online-only files and reparse points
  • βœ… Manifest Filtering: Automatically skips .origin.json metadata files
  • βœ… Error Handling: Retries file removal with exponential backoff for OneDrive sync issues
  • βœ… Cleanup Utility: Use cleanup_origin_files.ps1 to remove leftover manifest files from Desktop

Notes:

  • Discovery is recursive under 02_data and case-insensitive for extensions
  • Optional sidecar copy-back to source/ is enabled via copy_sidecar_to_source
  • If files remain on Desktop after "Send to", OneDrive may have restored them (check the error summary)

KB Operations (OneDrive)

Primary PC

  • Start watcher: powershell -File scripts/Start-Watcher.ps1
  • Stop watcher: powershell -File scripts/Stop-Watcher.ps1
  • Smoke test: powershell -File scripts/Smoke-Test.ps1
  • Health: powershell -File scripts/KB-Health.ps1
  • Run report: powershell -File tools/write_run_report.ps1
  • Config check: npm run kb:cfg:check
  • Analytics snapshot: npm run kb:analytics (pass -- --days 7 for weekly view)
  • Toggle dedupe: npm run kb:cfg:dedupe:on / npm run kb:cfg:dedupe:off
  • Toggle incremental updates: npm run kb:cfg:incr:on / npm run kb:cfg:incr:off
  • Consistency check: npm run kb:consistency

Secondary PC

  • Do not start the watcher.
  • Use streamlit run gui_app.py for search and answers.

Notes

  • Only one watcher process should run.
  • OneDrive folder must be set to Always keep on this device.
  • Duplicate protection is active through incremental updates and de-dup logic.
  • To auto-start on login, import scripts/KB_Watcher_StartOnLogin.xml in Task Scheduler (Action β†’ Import Task) and confirm the action path points to C:\_chunker.
  • After import or any restart, run npm run kb:health to verify a single Running (PID=…) instance.
  • Weekly maintenance: import scripts/KB_Weekly_Dedupe.xml to schedule Monday 09:00 cleanups (dedupe + run report) or run manually with npm run kb:report.

πŸ”„ Consolidation (2025-10-29)

  • New sidecar flags (config.json):
    • enable_json_sidecar (default: true)
    • enable_block_summary (default: true)
    • enable_grok (default: false)

Sidecar schema (high-level):

  • file, processed_at, department, type, output_folder, transcript

  • chunks[]: filename, path, size, index

  • code_blocks[] (for .py): type, name, signature, start_line, end_line, docstring

  • Older project iterations (e.g., ClaudeExportFixer, chat_log_chunker_v1, chat_watcher) were unified under C:\_chunker.

  • Historical outputs migrated to C:\_chunker\04_output\<ProjectName>_<timestamp>.

  • Legacy artifacts captured once per project (latest snapshot only):

    • Docs β†’ 99_doc\legacy\<ProjectName>_<timestamp>
    • Config β†’ 06_config\legacy\<ProjectName>_<timestamp>
    • Logs β†’ logs\archive\ (and legacy 05_logs\ on older trees; use scripts\Archive-ChunkerLogs.ps1 to merge)
    • DB/Backups β†’ 03_archive\legacy\<ProjectName>_<timestamp>
  • Script backups stored with timestamp prefixes at C:\Users\carucci_r\OneDrive - City of Hackensack\00_dev\backup_scripts\<ProjectName>\.

  • Policy: keep only the latest legacy snapshot per project (older snapshots pruned).

βš™οΈ Configuration

Edit config.json to customize:

Core Settings

  • File filter modes: all, patterns, suffix
  • Supported file extensions: .txt, .md, .csv, .json, .yaml, .py, .m, .dax, .ps1, .sql, .pdf, .docx, .xlsx, .xls, .slx
  • Chunk sizes and processing options: sentence limits, overlap settings
  • Notification settings: email alerts and summaries

RAG Settings

  • rag_enabled: Enable/disable RAG functionality
  • ollama_model: Ollama embedding model (default: nomic-embed-text)
  • faiss_persist_dir: FAISS index storage directory
  • chroma_persist_dir: ChromaDB storage directory (optional)

LangSmith Settings (Optional)

  • langsmith_api_key: Your LangSmith API key
  • langsmith_project: Project name for tracing
  • tracing_enabled: Enable/disable tracing
  • evaluation_enabled: Enable/disable evaluation

Monitoring Settings

  • debounce_window: File event debouncing time (seconds)
  • use_ready_signal: Wait for <filename>.ready markers before processing atomic pushes
  • failed_dir: Directory for failed file processing quarantine
  • max_workers: Maximum parallel processing workers

Vector Store Settings

  • batch.size: Number of chunks to insert into ChromaDB per batch (default 500)
  • batch.flush_every: Optional flush cadence for very large ingest jobs
  • batch.mem_soft_limit_mb: Soft memory cap for batching helper
  • search.ef_search: Overrides HNSW search ef after each batch to rebalance recall vs. latency

πŸ” RAG Usage

Setup

  1. Install Dependencies: python install_rag_dependencies.py
  2. Install Ollama: Download from ollama.ai
  3. Pull Model: ollama pull nomic-embed-text
  4. Enable RAG: Set "rag_enabled": true in config.json
  5. Start Processing: python watcher_splitter.py

Search Knowledge Base

Interactive Search

python rag_search.py

Command Line Search

# Single query
python rag_search.py --query "How do I fix vlookup errors?"

# Batch search
python rag_search.py --batch queries.txt --output results.json

# Different search types
python rag_search.py --query "Excel formulas" --search-type semantic
python rag_search.py --query "vlookup excel" --search-type keyword

GUI Search

streamlit run gui_app.py

Opens a browser interface for entering queries, browsing results, and viewing knowledge-base statistics.

Programmatic Search

from ollama_integration import initialize_ollama_rag

# Initialize RAG system
rag = initialize_ollama_rag()

# Search
results = rag.hybrid_search("How do I fix vlookup errors?", top_k=5)

# Display results
for result in results:
    print(f"Score: {result['score']:.3f}")
    print(f"Content: {result['content'][:100]}...")
    print(f"Source: {result['metadata']['source_file']}")

Example Output

Interactive Search Session:

RAG Search Interface
==================================================
Commands:
  search <query> - Search the knowledge base
  semantic <query> - Semantic similarity search
  keyword <query> - Keyword-based search
  stats - Show knowledge base statistics
  quit - Exit the interface

RAG> search How do I fix vlookup errors?

Search Results for: 'How do I fix vlookup errors?'
==================================================

1. Score: 0.847 (semantic)
   Source: excel_guide.md
   Type: .md
   Content: VLOOKUP is used to find values in a table. Syntax: VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup]). Use FALSE for exact matches...
   Keywords: vlookup, excel, formula, table

2. Score: 0.723 (semantic)
   Source: troubleshooting.xlsx
   Type: .xlsx
   Content: Common VLOOKUP errors include #N/A when lookup value not found, #REF when table array is invalid...
   Keywords: vlookup, error, troubleshooting, excel

Search completed in 0.234 seconds
Found 2 results

πŸ“Š Evaluation & Testing

Automated Evaluation

# Run comprehensive evaluation
python automated_eval.py

# Run specific tests
python rag_test.py

# Generate evaluation report
python -c "from automated_eval import AutomatedEvaluator; evaluator = AutomatedEvaluator({}); evaluator.generate_csv_report()"

Manual Evaluation

from rag_evaluation import RAGEvaluator
from rag_integration import FaithfulnessScorer

# Initialize evaluator
evaluator = RAGEvaluator()

# Evaluate retrieval quality
retrieval_metrics = evaluator.evaluate_retrieval(
    retrieved_docs=["doc1.md", "doc2.xlsx"],
    relevant_docs=["doc1.md", "doc2.xlsx", "doc3.pdf"],
    k_values=[1, 3, 5]
)

# Evaluate generation quality
generation_metrics = evaluator.evaluate_generation(
    reference="Check data types and table references",
    generated="Verify data types and table references for vlookup errors"
)

# Evaluate faithfulness
scorer = FaithfulnessScorer()
faithfulness_score = scorer.calculate_faithfulness(
    answer="VLOOKUP requires exact data types",
    context="VLOOKUP syntax requires exact data type matching"
)

print(f"Precision@5: {retrieval_metrics['precision_at_5']:.3f}")
print(f"ROUGE-1: {generation_metrics['rouge1']:.3f}")
print(f"Faithfulness: {faithfulness_score:.3f}")

LangSmith Integration

from langsmith_integration import initialize_langsmith

# Initialize LangSmith
langsmith = initialize_langsmith(
    api_key="your_api_key",
    project="chunker-rag-eval"
)

# Create evaluation dataset
test_queries = [
    {
        "query": "How do I fix vlookup errors?",
        "expected_answer": "Check data types and table references",
        "expected_sources": ["excel_guide.md", "troubleshooting.xlsx"]
    }
]

# Run evaluation
results = langsmith.run_evaluation(test_queries, rag_function)

πŸ“ Supported File Types

Type Extensions Processing Method Metadata Extracted
Text .txt, .md Direct text processing Word count, sentences, keywords
Structured .json, .csv, .yaml Parsed structure Schema, data types, samples
Office .xlsx, .xls, .docx Library extraction Sheets, formulas, formatting
Code .py, .ps1, .sql, .m, .dax AST/parsing Functions, classes, imports, docstrings
Documents .pdf Text extraction Pages, metadata, text content
Models .slx Specialized extraction Model structure, parameters

πŸ› οΈ Advanced Features

Real-time Monitoring

from watchdog_system import create_watchdog_monitor

# Initialize watchdog monitor
monitor = create_watchdog_monitor(config, process_callback)

# Start monitoring
monitor.start()

# Monitor stats
stats = monitor.get_stats()
print(f"Queue size: {stats['queue_size']}")
print(f"Processing files: {stats['processing_files']}")

Modular File Processing

from file_processors import process_excel_file, process_pdf_file

# Process specific file types
excel_content = process_excel_file("", "data.xlsx")
pdf_content = process_pdf_file("", "document.pdf")

Embedding Management

from embedding_helpers import EmbeddingManager

# Initialize embedding manager
manager = EmbeddingManager(chunk_size=1000, chunk_overlap=200)

# Process files for embedding
results = batch_process_files(file_paths, manager, extract_keywords_func)

πŸš€ Performance & Scalability

  • Parallel Processing: Multi-threaded file processing with configurable workers
  • Streaming: Large file support with memory-efficient streaming
  • Caching: FAISS index persistence for fast startup
  • Debouncing: Prevents duplicate processing of rapidly changing files
  • Graceful Degradation: Continues working even if optional components fail

πŸ”§ Troubleshooting

Common Issues

  1. Pydantic-Core Version Incompatible (Watcher/Manual Processing Crashes)

    # ChromaDB/deduplication requires pydantic-core 2.41.5; 2.42.0 causes SystemError on import
    pip install pydantic-core==2.41.5

    If the watcher or manual_process_files.py crashes immediately with SystemError: pydantic-core version incompatible, run the above. Then restart the watcher or re-run manual processing. Current deduplication.py also treats that import failure like β€œChroma unavailable” so the process may start without dedup until versions are fixed.

  2. Start-Watcher.ps1 exits before Python runs

    • KB_Shared must resolve: either %OneDriveCommercial% (or OneDrive / OneDriveConsumer) must expand to a path that contains KB_Shared, or config.json output_dir must expand to a real folder under KB_Shared (script derives KB_Shared as the parent of 04_output).
    • Python must be on PATH as python or py (launcher uses py -3 as fallback).
  3. ChromaDB Installation Fails (Windows)

    # Use FAISS instead
    pip install faiss-cpu
    # Or install build tools
    # Or use Docker deployment
  4. Ollama Not Available

    # Install Ollama from https://ollama.ai/
    # Pull the model
    ollama pull nomic-embed-text
  5. Memory Issues with Large Files

    # Enable streaming in config
    "enable_streaming": true,
    "stream_chunk_size": 1048576  # 1MB chunks
  6. UnicodeEncodeError in PowerShell Logs (Windows)

    # Switch console to UTF-8 before starting the watcher
    chcp 65001
    Set-Item env:PYTHONIOENCODING utf-8
    python watcher_splitter.py

    This prevents logging failures when filenames contain emoji or other non-ASCII characters.

Performance Optimization

  • Chunk Size: Adjust based on content type (75 for police, 150 for admin)
  • Parallel Workers: Set based on CPU cores (default: 4)
  • Debounce Window: Increase for slow file systems (default: 1s)
  • Index Persistence: Enable for faster startup after restart

πŸ“ˆ Monitoring & Analytics

  • Database Tracking: SQLite database with processing statistics
  • Session Metrics: Files processed, chunks created, performance metrics
  • Error Logging: Comprehensive error tracking and notification
  • System Metrics: CPU, memory, disk usage monitoring
  • RAG Metrics: Search performance, evaluation scores, user feedback

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Ollama for local embedding models
  • FAISS for vector similarity search
  • LangChain for RAG framework
  • LangSmith for evaluation and tracing
  • Watchdog for file system monitoring

πŸ”„ Version Control & GitHub

Git Repository

This project is version-controlled using Git and backed up to GitHub.

Remote Repository: https://github.com/racmac57/chunker_Web.git

Quick Git Commands

# Check status
git status

# Stage and commit changes
git add -A
git commit -m "Description of changes"

# Push to GitHub
git push origin main

# View commit history
git log --oneline -10

Files Excluded from Git

The following are automatically excluded via .gitignore:

  • Processed documents (99_doc/, 04_output/)
  • Archived files (03_archive/)
  • Database files (*.db, *.sqlite)
  • Log files (logs/, *.log)
  • Virtual environments (.venv/, venv/)
  • NLTK data (nltk_data/)
  • Temporary and backup files

Contributing via Git

  1. Clone the repository: git clone https://github.com/racmac57/chunker_Web.git
  2. Create a feature branch: git checkout -b feature-name
  3. Make changes and commit: git commit -m "Feature: description"
  4. Push to your fork and create a pull request

For detailed Git setup information, see GIT_SETUP_STATUS.md.

Directory Health

Last Cleanup: 2025-10-31 19:22:39
Items Scanned: 16595
Items Moved: 7
Items Deleted: 627
Snapshots Pruned: 0

Snapshot Policy: Keep only the latest legacy snapshot per project. Older snapshots are pruned during maintenance. Config backups follow the same policy.

Log Location: logs/archive/ (example historical run: logs/archive/maintenance/2025_10_31_19_16_35/)

Git Status: βœ… Repository initialized, connected to GitHub, and regularly backed up Release CI CI Release

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors