For AI/Claude project context and workflow guidance, see Claude.md.
Four concerns, now mapped to named sub-packages so the scope boundaries are legible without relocating files:
| Purpose | Sub-package | Key modules |
|---|---|---|
| 1. Ingest AI chat logs + chunk for archival | ingest/ |
watcher_splitter.py, file_processors.py, metadata_enrichment.py |
| 2. Feed prior conversation context back to AI | search/ |
ask.py, kb_ask_ollama.py, rag_integration.py |
| 3. AI-interactive file search | search/ |
gui_app.py (Streamlit), api_server.py (FastAPI), rag_search.py |
| 4. Knowledge base management | kb/ |
backfill_knowledge_base.py, chromadb_crud.py, deduplication.py, backup_manager.py |
Shared plumbing (SQLite pool, monitoring, watchdog debounce, job integrity,
query cache) lives in infra/. The optional RAG evaluation harness lives in
evaluation/ (install with pip install -r requirements-eval.txt).
The sub-package __init__.py files re-export the existing root modules; this
is a soft boundary layer, not a rewrite. Existing scripts continue to work.
pip install -r requirements.txt # minimal core (watcher + tests)
pip install -r requirements-app.txt # full RAG stack + search UIs
pip install -r requirements-eval.txt # optional RAG evaluation harness
pip install -r requirements-dev.txt # linters + test extrasstreamlit run gui_app.py # browser UI
uvicorn api_server:app --reload # REST API (docs at /docs)
python ask.py "your query" # CLITag upkeep:
tools/audit_tags.pyβ audit ChromaDB tag metadata against the canonical taxonomy.tools/retag_chromadb.pyβ re-runmetadata_enrichment.enrich_metadataand upsert tags (dry-run by default; pass--apply).
KB cleanup β directory-level (primary pipeline; one conversation = one directory):
tools/kb_inventory.py --root "%OneDriveCommercial%\KB_Shared\04_output" --skip-hashβ walk the KB, emit file-level CSV including aconversation_dircolumn.tools/kb_directory_staleness.pyβ group by base project slug across conversation_dirs, flag older duplicates. Markssafe_archivewhen every file hash in the older dir appears in the newer one; otherwisereview.tools/kb_archive_dirs.py --root <root>β dry-run first; add--applyto move whole directories to<root>/_archive/YYYY-MM-DD-dir-cleanup/and delete their chunk ids from ChromaDB. Rollback manifest written next to the archive.
KB cleanup β file-level (edge cases where dirs don't apply):
tools/kb_staleness_report.pyandtools/kb_archive.pyβ same pipeline but at file granularity. Use when conversation_dirs are flat.
Production KB lives at %OneDriveCommercial%\KB_Shared\04_output (see config.json).
Sidecar JSON generation is now disabled by default ("enable_json_sidecar": false);
ChromaDB metadata is the source of truth.
- Claude Code
chunk-chatSkill: New skill at.claude/skills/chunk-chat/replicates the core chunker pipeline directly inside a Claude Code session, eliminating the manual workflow of exporting chat logs, staging them in02_data/, and copying output back..claude/scripts/chat_chunker.py: Standalone Python chunker (stdlib only, zero external dependencies) with sentence-based splitting, overlap, tag detection, key term extraction, and full output artifact generation (chunks, transcript, sidecar JSON, origin manifest)..claude/skills/chunk-chat/SKILL.md: Skill definition that captures conversation context and runs the chunker inline. Invoke with/chunk-chator say "chunk this chat".- Output lands in
./chunked_chat/in the current working directory -- no staging or copying required.
See CHANGELOG.md for full details.
- Logging layout: All watcher-related logs under
logs\(including Task Scheduler / silent start). Usescripts\Archive-ChunkerLogs.ps1to move legacy05_logs\or rotatedwatcher_archive_*.logintologs\archive\. KB-PathSafetyReport.ps1: Fixed watcher PID check (do not assign to$pid).
See CHANGELOG.md for full details.
- SQLite logging:
chunker_db.log_processing()commitsprocessing_historybefore updatingdepartment_stats, avoiding nested-connection lock contention that could delay or hangmanual_process_files.pyafter a successful job. - Documentation: Version sync across core docs; clarified
manual_process_files.py --auto(scanswatch_folderfromconfig.json); troubleshooting for "Department stats update locked" / apparent manual-run hangs.
See CHANGELOG.md for full details.
- Documentation: Watcher startup docs updated for
Start-Watcher.ps1(KB_Shared detection fromconfig.json+ OneDrive env fallbacks,python/py -3launcher), deduplication/ChromaDB import resilience notes, corrected log paths in quick reference, troubleshooting addendum in07_docs/File_Processing_Investigation_Report.md.
See CHANGELOG.md for full details.
- Documentation Updates: Aligned all project documentation with current code and configuration. Fixed broken doc references in Claude.md, updated supported extensions and query_cache config in README, corrected config settings in SUMMARY.
See CHANGELOG.md for full details.
- Incremental Updates Skip Logic: Unchanged files are now skipped when incremental updates are enabled.
VersionTracker.has_changed()is consulted before reprocessing; matching content hash returns early. - Test Suite Fixes: All 76 tests pass. Fixed
test_incremental_updates_skip_reprocessing(recreate file after archiving) and job integrity fixtures (artifact sizes β₯ 50 bytes).
See CHANGELOG.md for full details.
- Pydantic-Core Compatibility Fix: Resolved watcher startup crash caused by pydantic-core 2.42.0 incompatibility with pydantic 2.12.x. ChromaDB/deduplication requires
pydantic-core==2.41.5. Fix:pip install pydantic-core==2.41.5 - Documentation Updates: Corrected log paths across project docs (
logs/for watcher.log and watcher_start.log;05_logs/for silent start). Updated troubleshooting for stale watcher PID, pydantic crashes, and file processing issues. - KB-Health Script: Health check reports Watcher status, OneDrive paths, ChromaDB presence, and directory accessibility. Output/archive paths read from config.json.
See CHANGELOG.md for full details.
- Job Integrity Validation System: Comprehensive post-processing validation prevents incomplete outputs from being archived as successful jobs
- Validates all required artifacts (chunks, transcript, sidecar, manifest) with size thresholds
- File stability checks prevent OneDrive sync race conditions (2s window, 30s timeout)
- Intelligent retry logic for timing-related failures (15s delay, max 1 retry)
- Detailed failure reports with full diagnostics in
{job_id}.integrity_fail.log - Artifact quarantine moves failed job files to
failed/folder - Multi-channel notifications (Windows toast, log markers, console)
- 19 unit tests (all passing), comprehensive documentation
- 100% backward compatible - can be disabled via config
- Enhanced Logging: Start-Watcher.ps1 now shows startup banner with all monitored paths
- ChromaDB Metadata Fix: Fixed validation error by converting lists to strings and handling None values
See CHANGELOG.md for full details.
- Orphaned Manifest File Fix: Fixed bug where
.origin.jsonmanifest files were left in02_dataafter processing. Both archive and quarantine functions now properly move manifest files. - Cleanup Script: Added
Cleanup-Orphaned-Manifests.ps1to detect and archive orphaned manifests with dry-run and verbose modes.
See CHANGELOG.md for full details.
- KB Consolidation Complete: Backfilled 8,136 chunks to ChromaDB knowledge base (8,142 total). Full semantic search now operational with verified query results.
- Pydantic 2.x Compatibility Fix: Fixed
rag_integration.pyto work with pydantic 2.12+ by manually injecting private attributes into ChromaDB Collection objects. - Output Consolidation Script: Added
Move-Unique-Local-Output-To-OneDrive.ps1for consolidating local output folders to OneDrive KB_Shared. - Documentation Consolidation: Archived duplicate docs from
07_docs/to07_docs/archived_20260130/. Root directory now has single canonical versions of all key documentation files. - Config.json Enhanced: Merged additional settings from 07_docs config - now supports 15 file types (added pdf, docx, xlsx, yaml, etc.), auto KB insertion, Ollama embedding config, and performance tuning options.
- Completion Report: Added
07_docs/KB_CONSOLIDATION_COMPLETION_REPORT.mdand07_docs/DOC_CONSOLIDATION_REPORT_20260130.mddocumenting consolidation processes.
See CHANGELOG.md for full details.
- ChromaDB backfill and verify: Pydantic 2.12+ and ChromaDB 0.3.x support. Dummy-embedding fallback when sentence_transformers unavailable. Backfill and verify scripts updated.
- Chunk ID sanitization: Backfill sanitizes chunk IDs for ChromaDB/DuckDB so quotes and special chars do not break SQL.
- Ollama RAG: Ollama integration tolerates missing deps with clear install message. Scripts to pull nomic-embed-text and run backfill with real embeddings (Python 3.11 venv).
- Department migration: 20+ departments and priority-based detection from laptop source. Department as fallback for archive (flat 03_archive when default).
- ChromaDB guide: Optional real embeddings, RAG search (Ollama + FAISS), chunk ID sanitization, Ollama on Windows.
See CHANGELOG.md for full details.
- Unicode Encoding Fix: Fixed Unicode encoding errors in
manual_process_files.pywhen running on Windows console- Replaced Unicode checkmark/cross characters with ASCII-safe alternatives
- Added UTF-8 encoding handling for Windows console output
- Fixed file existence check to prevent errors during batch processing
See CHANGELOG.md for full details.
- File Processing Investigation Report: Added comprehensive troubleshooting guide for unprocessed files
- Root cause analysis for files not being processed by the chunker
- Configuration verification procedures
- Manual processing solutions using
manual_process_files.py - Watcher status checking and restart procedures
- Documentation at
07_docs/File_Processing_Investigation_Report.md
See CHANGELOG.md for full details.
- OneDrive SYNC Folder Complete Removal: Successfully removed SYNC folder from both local and cloud
- Created
Move-SYNC-To-Temp-And-Delete.ps1- Moves SYNC contents to C:\TEMP and deletes empty SYNC folder - Successfully moved 35 items (41.85 MB) to backup and deleted SYNC folder locally
- User deleted SYNC folder from web interface, triggering OneDrive deletion sync
- OneDrive processing 178,824 file deletions (30-60 minutes estimated)
- Additional troubleshooting scripts for stuck sync operations and folder blocking issues
- Created
See CHANGELOG.md for full details.
- OneDrive SYNC Directory Removal: Added automated scripts to remove problematic SYNC directories causing recurring "path is too long" sync errors
- Safely removes 4 directories (565,962 files, ~165 GB) from OneDrive sync
- Uses robocopy for better OneDrive cloud file handling, then deletes directories
- Files moved to backup location initially, then deleted after verification
- Complete documentation and troubleshooting guides included
- Connection issue resolution scripts and monitoring tools added
- Cloud sync status checking and laptop sync guidance provided
See CHANGELOG.md for full details.
- OneDrive Desktop Auto-Repair: Added comprehensive PowerShell scripts to detect and fix Desktop misalignment between Windows and OneDrive, resolving "dual desktop" sync issues
- OneDrive Desktop Post-Verification Monitor: Added continuous monitoring script (
OneDrive_Desktop_PostVerify_Monitor.ps1) that checks every 30 seconds until all Desktop alignment checks pass, with color-coded dashboard, sync status analysis, status histogram, and auto-exit on success - Desktop Path Detection Fix: Fixed false negative in monitor script by using Windows Known Folder API to correctly detect OneDrive-redirected Desktop paths
- Sync Status Improvements: Enhanced sync status reporting to properly identify folders vs files and provide status breakdown histogram
See CHANGELOG.md for full details.
- Fixed 85+ recursive paths causing OneDrive sync failures, including SB_160116 hall of mirrors issue
Version 2.1.19 - OneDrive SYNC directory removal scripts to fix recurring path length sync errors, with robocopy-based file handling and comprehensive documentation.
Version 2.1.18 - OneDrive Desktop Auto-Repair scripts with continuous monitoring, Known Folder API path detection fix, and enhanced sync status reporting.
Version 2.1.17 - ChromaDB rebuilt with compatibility fixes, streamlined release automation, refreshed documentation, comprehensive regression coverage, plus watcher stability and SQLite hardening.
- Tiny File Archiving: Files under 100 bytes are automatically parked in
03_archive/skipped_files/with their manifests to eliminate endless βtoo smallβ retries. - Manifest & Hash Safety: Watcher now skips any file containing
.origin.jsonin its name and recomputes content hashes when the manifest is missing a checksum so incremental tracking remains intact. - Chunk Writer Hardening: Consolidated
write_chunk_files()helper creates the directory once, writes UTF-8 chunks with defensive logging, andcopy_manifest_sidecar()guarantees parent folders exist before copying manifests. - Parallel Queue Handling: Added optional
multiprocessing.Poolbatches for queues β₯32 files (config flag), plus automatic pruning of theprocessed_filesset to prevent long-running watcher stalls. - Tokenizer & Metrics Optimizations: Sentence tokenization is LRU-cached, system metrics run on a background executor, and notification bursts are throttled with a 60-second rate limiter per alert key.
- SQLite Resilience: Centralized
_conn()helper sets 60β―s timeouts,log_error()now understands both legacy signatures and retries lock errors, andrun_integrity_check()validates the DB at startup. - Test Coverage & Pytest Guardrails: Root
conftest.pyskips bulky99_doc/legacysuites andtests/test_db.pysmoke-tests the new retry path to ensure future regressions fail fast. - Database Lock Monitoring:
MONITOR_DB_LOCKS.mddocuments command-line checks, baseline metrics (1.5 errors/min), and alert thresholds (3 errors/min = 2Γ baseline). - Watcher Bridge Support:
watcher_splitter.pyunderstands.partstaging files, waits for optional.readysignals, retries processing up to three times, and quarantines stubborn failures to03_archive/failed/. - Batched Chroma Ingest:
ChromaRAG.add_chunks_bulk()honoursbatch.size, skips null embeddings, and refresheshnsw:search_effromconfig.jsonso the vector store keeps pace with high-volume ingest.
- ChromaDB Rebuild: Upgraded to
chromadb 1.3.4, recreated the collection, and re-ran the backfill so 2,907 enriched chunks are in sync with the latest pipeline. - Dedup Reliability:
deduplication.pynow ships withhnswlibcompatibility shims, lettingpython deduplication.py --auto-removecomplete without legacy metadata errors. - Release Helper:
scripts/release_commit_and_tag.batautomates doc staging, backups, commit/tag creation, and pushes while rotating logs; the 2025-11-07 dry run and live validation are logged indocs/RELEASE_WORKFLOW.md. - Regression Tests: Replaced placeholder suites with 52-case pytest coverage for query caching, incremental updates, backup management, and monitoring to mirror the production APIs.
- Watcher & DB Resilience (Novβ―2025): Skips manifests/archives/output files, sanitises output folder names, replaces Unicode logging arrows, adds safe archive moves, and introduces exponential-backoff SQLite retries to squash recursion, path-length, and βdatabase lockedβ errors.
What changed in v2.1.8? See the [changelog entry](./CHANGELOG.md#v2.1.17 - --2025-11-07---chromadb-rebuild--release-automation).
- Multiprocessing: Parallel file processing and ChromaDB inserts with 4-8 workers
- Performance: 20x faster backfill (100-200 chunks/second vs 5 chunks/second)
- Batch Optimization: Optimized batch sizes (500-1000 chunks) for ChromaDB efficiency
- HNSW Tuning: Proper vector index configuration (M=32, ef_construction=512, ef_search=200)
- CPU Monitoring: Real-time CPU tracking with saturation alerts
- Duplicate Detection: Pre-insertion verification prevents duplicate chunks
- Verification Tools: Comprehensive scripts to verify backfill completeness
- Empty Folder Logging: Identifies folders without chunk files
- Count Discrepancy Alerts: Warns when expected vs actual counts differ
- Chunk Completeness Verification: Validates all chunks from all folders are in KB
- Performance Metrics: Detailed throughput, memory, and CPU statistics
- β‘ Storage Optimization: Reduced storage overhead by 50-60% via MOVE operations instead of COPY
- π OneDrive Sync Elimination: 100% reduction in sync overhead by moving files out of OneDrive
- π Manifest Tracking: Complete origin tracking with
.origin.jsonfiles - π Enhanced Archive: MOVE with 3 retry attempts and graceful fallback to COPY
- π― Department as fallback: Archive uses department subfolders only when department is explicit (path or enrichment); default department files go to flat
03_archive/ - π Smart Retry Logic: Handles Windows permission issues with automatic retries
- β‘ Processing Loop Resolution: Fixed infinite loops that caused system hangs
- π Smart File Archiving: Failed files automatically moved to organized archive folders
- π Database Stability: Eliminated "database is locked" errors with batch operations
- β‘ 8-12x Speed Improvement: Dynamic parallel workers and optimized processing
- π Advanced RAG System: Ollama + FAISS for local embeddings and semantic search
- π Comprehensive Evaluation: Precision@K, Recall@K, MRR, ROUGE, BLEU, Faithfulness scoring
- π LangSmith Integration: Tracing, evaluation, and feedback collection
- β‘ Real-time Monitoring: Watchdog-based file system monitoring with debouncing
- π€ Hybrid Search: Combines semantic similarity with keyword matching
- π Automated Evaluation: Scheduled testing with regression detection
- π‘οΈ Production Ready: Graceful degradation, error handling, and monitoring
- π Source Folder Copying: Configurable copying of processed files back to source locations
- C:/_chunker - Main project directory with scripts
- .claude/ - Claude Code skill and scripts (
chunk-chatskill,chat_chunker.py) - 02_data/ - Input files to be processed (watch folder)
- 03_archive/ - Archived original files (in OneDrive KB_Shared when configured)
- 03_archive/skipped_files/ - Files too small to process (< 100 bytes) - automatically archived
- 04_output/ - Generated chunks and transcripts (in OneDrive KB_Shared when configured)
- logs/ - All runtime logs (
watcher.log,watcher_start.log,watcher_start_silent.log,manual_process.log); older material underlogs/archive/afterscripts\Archive-ChunkerLogs.ps1 - 06_config/ - Configuration files
- 99_doc/legacy/ - Consolidated legacy docs (latest snapshot per project)
- 06_config/legacy/ - Consolidated legacy config (latest snapshot per project)
- logs/archive/ - Migrated or rotated logs (optional; created by archive script)
- 03_archive/legacy/ - Consolidated legacy db/backups (latest snapshot per project)
- chroma_db/ - ChromaDB vector database storage
- faiss_index/ - FAISS vector database storage
- evaluations/ - RAG evaluation results
- reports/ - Automated evaluation reports
- Place files to process in
02_data/folder - Run the watcher:
python watcher_splitter.py - Check
04_output/for processed chunks and transcripts - Original files are moved to
03_archive/after processing
- Install RAG dependencies:
python install_rag_dependencies.py - Install Ollama and pull model:
ollama pull nomic-embed-text - Enable RAG in
config.json: Set"rag_enabled": true - Run the watcher:
python watcher_splitter.py - Search knowledge base:
python rag_search.py
For high-volume processing and advanced task management:
-
Install Celery Dependencies:
pip install celery redis flower
-
Start Redis Server:
# Windows: Download from https://github.com/microsoftarchive/redis/releases redis-server # Linux: sudo apt-get install redis-server # macOS: brew install redis
-
Start Celery Services:
# Option A: Use orchestrator (recommended) python orchestrator.py # Option B: Start manually celery -A celery_tasks worker --loglevel=info --concurrency=4 celery -A celery_tasks beat --loglevel=info celery -A celery_tasks flower --port=5555 python enhanced_watchdog.py
-
Monitor Tasks:
- Flower Dashboard: http://localhost:5555 (with authentication)
- Celery CLI:
celery -A celery_tasks inspect active - Logs: Check
logs/watcher.log
-
Security & Priority Features:
- Flower Authentication: Default credentials logged on startup
- Priority Queues: High-priority processing for legal/police files
- Redis Fallback: Automatic fallback to direct processing if Redis fails
- Task Timeouts: 300s hard limit with graceful handling
-
Configuration:
{ "celery_enabled": true, "celery_broker": "redis://localhost:6379/0", "celery_task_time_limit": 300, "celery_worker_concurrency": 4, "priority_departments": ["legal", "police"] } -
Environment Variables (Optional):
export FLOWER_USERNAME="your_username" export FLOWER_PASSWORD="your_secure_password"
All new subsystems ship disabled by default so existing deployments behave exactly as before. Enable individual features by updating config.json in the project root.
- Adds semantic tags, key terms, summaries, and source metadata to chunk sidecars and manifests.
- Output schema documented in
docs/METADATA_SCHEMA.md. - Enable with:
"metadata_enrichment": { "enabled": true }
- Background thread performs disk, throughput, and ChromaDB checks and escalates via the notification system.
- Configure thresholds in
config.jsonunder themonitoringsection; default recipients come fromnotification_system.py. - Start by setting
"monitoring": { "enabled": true }.
- Current Performance: 1.5 database lock errors/minute baseline (68% reduction from previous)
- Monitoring Documentation: See
MONITOR_DB_LOCKS.mdfor comprehensive monitoring commands and alert thresholds - Real-time Monitoring:
# Watch for lock errors in real-time powershell -Command "Get-Content watcher_live.log -Wait | Select-String -Pattern 'Failed to log|database is locked'" # Check hourly error count powershell -Command "(Get-Content watcher_live.log | Select-String -Pattern 'Failed to log processing' | Select-Object -Last 100 | Measure-Object).Count"
- Alert Threshold: Flag if errors exceed 3/minute (2x current baseline)
- Review Schedule: Monitor every 8-12 hours using commands in
MONITOR_DB_LOCKS.md - Key Findings:
- 92% of lock errors occur in
log_processing()(lacks retry wrapper) - 8% in
_update_department_stats()(has 5-retry exponential backoff) - Current retry config: get_connection (3 retries), dept_stats (5 retries with 1.5x backoff)
- 92% of lock errors occur in
- Automatic Archiving: Files under 100 bytes (default
min_file_size_bytes) are automatically moved to03_archive/skipped_files/ - Examples: Empty files, "No measures found" messages, test placeholders
- Behavior: Files are preserved with their
.origin.jsonmanifests for review rather than deleted or left to trigger repeated warnings - Configuration: Adjust threshold in department config via
min_file_size_bytesparameter (default: 100) - Logs: Look for
[INFO] File too short (X chars), archiving: filenamemessages
- Prevents duplicate chunks from entering ChromaDB in both watcher and backfill flows.
- Optionally run cleanup via
python deduplication.py --auto-remove. - Already present in
config.json; flip"enabled": trueto activate.
- Enables an in-memory LRU + TTL cache in
rag_integration.pyso repeat queries avoid hitting ChromaDB. - Configure
ttl_seconds,max_entriesunder thequery_cachesection. - API users can inspect runtime metrics via
GET /api/cache/statsonce the cache is enabled. - Example from
config.json:"query_cache": { "enabled": true, "ttl_seconds": 600, "max_entries": 512 }
- Uses a shared
VersionTrackerto hash inputs, skip untouched files, remove old chunk IDs, and persist deterministic chunk identifiers. - Tracker state defaults to
06_config/file_versions.json(override withversion_file). - Typical configuration:
"incremental_updates": { "enabled": true, "version_file": "06_config/file_versions.json", "hash_algorithm": "sha256" }
- After enabling, unchanged files are skipped by the watcher/backfill, while reprocessed sources clean up stale artifacts before writing new chunks.
- Creates compressed archives of ChromaDB and critical directories on a schedule.
- Configure destination, retention, and schedule in the
backupsection. - Manual run:
python backup_manager.py --config config.json create --label on-demand.
After toggling features, restart the watcher (python watcher_splitter.py) so runtime components reinitialize with the new configuration.
- Organized output by source file name with timestamp prefixes
- Multi-file type support - .txt, .md, .csv, .json, .yaml, .py, .m, .dax, .ps1, .sql, .pdf, .docx, .xlsx, .xls, .slx
- Unicode filename support - Handles files with emojis, special characters, and symbols
- Enhanced filename sanitization - Automatically cleans problematic characters
- Database tracking and logging - Comprehensive activity monitoring
- Automatic file organization - Moves processed files to archive
- Ollama Integration - Local embeddings with nomic-embed-text model
- FAISS Vector Database - High-performance similarity search
- Hybrid Search - Combines semantic similarity with keyword matching
- ChromaDB Support - Alternative vector database (optional)
- Real-time Monitoring - Watchdog-based file system monitoring
- Debounced Processing - Prevents race conditions and duplicate processing
- Dynamic Parallel Processing - Up to 12 workers for large batches (50+ files)
- Batch Processing - Configurable batch sizes with system overload protection
- Database Optimization - Batch logging eliminates locking issues
- Smart File Archiving - Failed files automatically moved to organized folders
- Real-time Performance Metrics - Files/minute, avg processing time, peak CPU/memory
- 500+ File Capability - Handles large volumes efficiently without loops or crashes
- Source Folder Copying - Configurable copying of processed files back to source locations
- Comprehensive Metrics - Precision@K, Recall@K, MRR, NDCG@K
- Generation Quality - ROUGE-1/2/L, BLEU, BERTScore
- Faithfulness Scoring - Evaluates answer grounding in source context
- Context Utilization - Measures how much context is used in answers
- Automated Evaluation - Scheduled testing with regression detection
- LangSmith Integration - Tracing, evaluation, and feedback collection
-
chunk-chatSkill - Process conversations inline without manual export/staging/copy workflow - Standalone Chunker - Zero-dependency Python script replicates core pipeline (sentence splitting, overlap, metadata enrichment)
- Direct Output - Chunks, transcript, sidecar, and origin manifest written to working directory
- File or Context - Process a file path (
/chunk-chat ./file.txt) or capture the current conversation automatically
- Graceful Degradation - Continues working even if RAG components fail
- Error Handling - Robust error recovery and logging
- Performance Monitoring - System metrics and performance tracking
- Security Redaction - PII masking in metadata
- Modular Architecture - Clean separation of concerns
- JSON Sidecar (optional) - Per-file sidecar with chunk list, metadata, and Python code blocks
To quickly drop files into 02_data via rightβclick:
- Press Win+R β type
shell:sendtoβ Enter - Copy
Chunker_MoveOptimized.batto the SendTo folder - Rightβclick any file β Send to β
Chunker_MoveOptimized.bat
PowerShell Script: Chunker_MoveOptimized.ps1 + Chunker_MoveOptimized.bat
- Moves files/folders from OneDrive or local folders to
02_data, preserving relative paths - Writes
<filename>.origin.jsonmanifest (original_full_path, times, size, sha256, optional hmac) - Automatically skips
.origin.jsonmanifest files to prevent processing loops - Handles OneDrive cloud files and reparse points using
-Forceparameter - Uses multi-method file detection for robust OneDrive compatibility
- Watcher reads the manifest and populates sidecar
origin(falls back if missing)
Features:
- β OneDrive Support: Detects and processes OneDrive online-only files and reparse points
- β
Manifest Filtering: Automatically skips
.origin.jsonmetadata files - β Error Handling: Retries file removal with exponential backoff for OneDrive sync issues
- β
Cleanup Utility: Use
cleanup_origin_files.ps1to remove leftover manifest files from Desktop
Notes:
- Discovery is recursive under
02_dataand case-insensitive for extensions - Optional sidecar copy-back to
source/is enabled viacopy_sidecar_to_source - If files remain on Desktop after "Send to", OneDrive may have restored them (check the error summary)
Primary PC
- Start watcher:
powershell -File scripts/Start-Watcher.ps1 - Stop watcher:
powershell -File scripts/Stop-Watcher.ps1 - Smoke test:
powershell -File scripts/Smoke-Test.ps1 - Health:
powershell -File scripts/KB-Health.ps1 - Run report:
powershell -File tools/write_run_report.ps1 - Config check:
npm run kb:cfg:check - Analytics snapshot:
npm run kb:analytics(pass-- --days 7for weekly view) - Toggle dedupe:
npm run kb:cfg:dedupe:on/npm run kb:cfg:dedupe:off - Toggle incremental updates:
npm run kb:cfg:incr:on/npm run kb:cfg:incr:off - Consistency check:
npm run kb:consistency
Secondary PC
- Do not start the watcher.
- Use
streamlit run gui_app.pyfor search and answers.
Notes
- Only one watcher process should run.
- OneDrive folder must be set to Always keep on this device.
- Duplicate protection is active through incremental updates and de-dup logic.
- To auto-start on login, import
scripts/KB_Watcher_StartOnLogin.xmlin Task Scheduler (Action β Import Task) and confirm the action path points toC:\_chunker. - After import or any restart, run
npm run kb:healthto verify a singleRunning (PID=β¦)instance. - Weekly maintenance: import
scripts/KB_Weekly_Dedupe.xmlto schedule Monday 09:00 cleanups (dedupe + run report) or run manually withnpm run kb:report.
- New sidecar flags (config.json):
enable_json_sidecar(default: true)enable_block_summary(default: true)enable_grok(default: false)
Sidecar schema (high-level):
-
file,processed_at,department,type,output_folder,transcript -
chunks[]: filename, path, size, index -
code_blocks[](for .py): type, name, signature, start_line, end_line, docstring -
Older project iterations (e.g., ClaudeExportFixer, chat_log_chunker_v1, chat_watcher) were unified under
C:\_chunker. -
Historical outputs migrated to
C:\_chunker\04_output\<ProjectName>_<timestamp>. -
Legacy artifacts captured once per project (latest snapshot only):
- Docs β
99_doc\legacy\<ProjectName>_<timestamp> - Config β
06_config\legacy\<ProjectName>_<timestamp> - Logs β
logs\archive\(and legacy05_logs\on older trees; usescripts\Archive-ChunkerLogs.ps1to merge) - DB/Backups β
03_archive\legacy\<ProjectName>_<timestamp>
- Docs β
-
Script backups stored with timestamp prefixes at
C:\Users\carucci_r\OneDrive - City of Hackensack\00_dev\backup_scripts\<ProjectName>\. -
Policy: keep only the latest legacy snapshot per project (older snapshots pruned).
Edit config.json to customize:
- File filter modes: all, patterns, suffix
- Supported file extensions: .txt, .md, .csv, .json, .yaml, .py, .m, .dax, .ps1, .sql, .pdf, .docx, .xlsx, .xls, .slx
- Chunk sizes and processing options: sentence limits, overlap settings
- Notification settings: email alerts and summaries
rag_enabled: Enable/disable RAG functionalityollama_model: Ollama embedding model (default: nomic-embed-text)faiss_persist_dir: FAISS index storage directorychroma_persist_dir: ChromaDB storage directory (optional)
langsmith_api_key: Your LangSmith API keylangsmith_project: Project name for tracingtracing_enabled: Enable/disable tracingevaluation_enabled: Enable/disable evaluation
debounce_window: File event debouncing time (seconds)use_ready_signal: Wait for<filename>.readymarkers before processing atomic pushesfailed_dir: Directory for failed file processing quarantinemax_workers: Maximum parallel processing workers
batch.size: Number of chunks to insert into ChromaDB per batch (default 500)batch.flush_every: Optional flush cadence for very large ingest jobsbatch.mem_soft_limit_mb: Soft memory cap for batching helpersearch.ef_search: Overrides HNSW search ef after each batch to rebalance recall vs. latency
- Install Dependencies:
python install_rag_dependencies.py - Install Ollama: Download from ollama.ai
- Pull Model:
ollama pull nomic-embed-text - Enable RAG: Set
"rag_enabled": trueinconfig.json - Start Processing:
python watcher_splitter.py
python rag_search.py# Single query
python rag_search.py --query "How do I fix vlookup errors?"
# Batch search
python rag_search.py --batch queries.txt --output results.json
# Different search types
python rag_search.py --query "Excel formulas" --search-type semantic
python rag_search.py --query "vlookup excel" --search-type keywordstreamlit run gui_app.pyOpens a browser interface for entering queries, browsing results, and viewing knowledge-base statistics.
from ollama_integration import initialize_ollama_rag
# Initialize RAG system
rag = initialize_ollama_rag()
# Search
results = rag.hybrid_search("How do I fix vlookup errors?", top_k=5)
# Display results
for result in results:
print(f"Score: {result['score']:.3f}")
print(f"Content: {result['content'][:100]}...")
print(f"Source: {result['metadata']['source_file']}")Interactive Search Session:
RAG Search Interface
==================================================
Commands:
search <query> - Search the knowledge base
semantic <query> - Semantic similarity search
keyword <query> - Keyword-based search
stats - Show knowledge base statistics
quit - Exit the interface
RAG> search How do I fix vlookup errors?
Search Results for: 'How do I fix vlookup errors?'
==================================================
1. Score: 0.847 (semantic)
Source: excel_guide.md
Type: .md
Content: VLOOKUP is used to find values in a table. Syntax: VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup]). Use FALSE for exact matches...
Keywords: vlookup, excel, formula, table
2. Score: 0.723 (semantic)
Source: troubleshooting.xlsx
Type: .xlsx
Content: Common VLOOKUP errors include #N/A when lookup value not found, #REF when table array is invalid...
Keywords: vlookup, error, troubleshooting, excel
Search completed in 0.234 seconds
Found 2 results
# Run comprehensive evaluation
python automated_eval.py
# Run specific tests
python rag_test.py
# Generate evaluation report
python -c "from automated_eval import AutomatedEvaluator; evaluator = AutomatedEvaluator({}); evaluator.generate_csv_report()"from rag_evaluation import RAGEvaluator
from rag_integration import FaithfulnessScorer
# Initialize evaluator
evaluator = RAGEvaluator()
# Evaluate retrieval quality
retrieval_metrics = evaluator.evaluate_retrieval(
retrieved_docs=["doc1.md", "doc2.xlsx"],
relevant_docs=["doc1.md", "doc2.xlsx", "doc3.pdf"],
k_values=[1, 3, 5]
)
# Evaluate generation quality
generation_metrics = evaluator.evaluate_generation(
reference="Check data types and table references",
generated="Verify data types and table references for vlookup errors"
)
# Evaluate faithfulness
scorer = FaithfulnessScorer()
faithfulness_score = scorer.calculate_faithfulness(
answer="VLOOKUP requires exact data types",
context="VLOOKUP syntax requires exact data type matching"
)
print(f"Precision@5: {retrieval_metrics['precision_at_5']:.3f}")
print(f"ROUGE-1: {generation_metrics['rouge1']:.3f}")
print(f"Faithfulness: {faithfulness_score:.3f}")from langsmith_integration import initialize_langsmith
# Initialize LangSmith
langsmith = initialize_langsmith(
api_key="your_api_key",
project="chunker-rag-eval"
)
# Create evaluation dataset
test_queries = [
{
"query": "How do I fix vlookup errors?",
"expected_answer": "Check data types and table references",
"expected_sources": ["excel_guide.md", "troubleshooting.xlsx"]
}
]
# Run evaluation
results = langsmith.run_evaluation(test_queries, rag_function)| Type | Extensions | Processing Method | Metadata Extracted |
|---|---|---|---|
| Text | .txt, .md | Direct text processing | Word count, sentences, keywords |
| Structured | .json, .csv, .yaml | Parsed structure | Schema, data types, samples |
| Office | .xlsx, .xls, .docx | Library extraction | Sheets, formulas, formatting |
| Code | .py, .ps1, .sql, .m, .dax | AST/parsing | Functions, classes, imports, docstrings |
| Documents | Text extraction | Pages, metadata, text content | |
| Models | .slx | Specialized extraction | Model structure, parameters |
from watchdog_system import create_watchdog_monitor
# Initialize watchdog monitor
monitor = create_watchdog_monitor(config, process_callback)
# Start monitoring
monitor.start()
# Monitor stats
stats = monitor.get_stats()
print(f"Queue size: {stats['queue_size']}")
print(f"Processing files: {stats['processing_files']}")from file_processors import process_excel_file, process_pdf_file
# Process specific file types
excel_content = process_excel_file("", "data.xlsx")
pdf_content = process_pdf_file("", "document.pdf")from embedding_helpers import EmbeddingManager
# Initialize embedding manager
manager = EmbeddingManager(chunk_size=1000, chunk_overlap=200)
# Process files for embedding
results = batch_process_files(file_paths, manager, extract_keywords_func)- Parallel Processing: Multi-threaded file processing with configurable workers
- Streaming: Large file support with memory-efficient streaming
- Caching: FAISS index persistence for fast startup
- Debouncing: Prevents duplicate processing of rapidly changing files
- Graceful Degradation: Continues working even if optional components fail
-
Pydantic-Core Version Incompatible (Watcher/Manual Processing Crashes)
# ChromaDB/deduplication requires pydantic-core 2.41.5; 2.42.0 causes SystemError on import pip install pydantic-core==2.41.5If the watcher or
manual_process_files.pycrashes immediately withSystemError: pydantic-core version incompatible, run the above. Then restart the watcher or re-run manual processing. Currentdeduplication.pyalso treats that import failure like βChroma unavailableβ so the process may start without dedup until versions are fixed. -
Start-Watcher.ps1 exits before Python runs
- KB_Shared must resolve: either
%OneDriveCommercial%(orOneDrive/OneDriveConsumer) must expand to a path that containsKB_Shared, orconfig.jsonoutput_dirmust expand to a real folder underKB_Shared(script derivesKB_Sharedas the parent of04_output). - Python must be on PATH as
pythonorpy(launcher usespy -3as fallback).
- KB_Shared must resolve: either
-
ChromaDB Installation Fails (Windows)
# Use FAISS instead pip install faiss-cpu # Or install build tools # Or use Docker deployment
-
Ollama Not Available
# Install Ollama from https://ollama.ai/ # Pull the model ollama pull nomic-embed-text
-
Memory Issues with Large Files
# Enable streaming in config "enable_streaming": true, "stream_chunk_size": 1048576 # 1MB chunks
-
UnicodeEncodeError in PowerShell Logs (Windows)
# Switch console to UTF-8 before starting the watcher chcp 65001 Set-Item env:PYTHONIOENCODING utf-8 python watcher_splitter.py
This prevents logging failures when filenames contain emoji or other non-ASCII characters.
- Chunk Size: Adjust based on content type (75 for police, 150 for admin)
- Parallel Workers: Set based on CPU cores (default: 4)
- Debounce Window: Increase for slow file systems (default: 1s)
- Index Persistence: Enable for faster startup after restart
- Database Tracking: SQLite database with processing statistics
- Session Metrics: Files processed, chunks created, performance metrics
- Error Logging: Comprehensive error tracking and notification
- System Metrics: CPU, memory, disk usage monitoring
- RAG Metrics: Search performance, evaluation scores, user feedback
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Ollama for local embedding models
- FAISS for vector similarity search
- LangChain for RAG framework
- LangSmith for evaluation and tracing
- Watchdog for file system monitoring
This project is version-controlled using Git and backed up to GitHub.
Remote Repository: https://github.com/racmac57/chunker_Web.git
# Check status
git status
# Stage and commit changes
git add -A
git commit -m "Description of changes"
# Push to GitHub
git push origin main
# View commit history
git log --oneline -10The following are automatically excluded via .gitignore:
- Processed documents (
99_doc/,04_output/) - Archived files (
03_archive/) - Database files (
*.db,*.sqlite) - Log files (
logs/,*.log) - Virtual environments (
.venv/,venv/) - NLTK data (
nltk_data/) - Temporary and backup files
- Clone the repository:
git clone https://github.com/racmac57/chunker_Web.git - Create a feature branch:
git checkout -b feature-name - Make changes and commit:
git commit -m "Feature: description" - Push to your fork and create a pull request
For detailed Git setup information, see GIT_SETUP_STATUS.md.
Last Cleanup: 2025-10-31 19:22:39
Items Scanned: 16595
Items Moved: 7
Items Deleted: 627
Snapshots Pruned: 0
Snapshot Policy: Keep only the latest legacy snapshot per project. Older snapshots are pruned during maintenance. Config backups follow the same policy.
Log Location: logs/archive/ (example historical run: logs/archive/maintenance/2025_10_31_19_16_35/)
Git Status: β
Repository initialized, connected to GitHub, and regularly backed up