-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Description
Implement the orchestration layer that integrates the Repository Scanner (#3), Embedding Provider, and Vector Storage (#4) into a cohesive indexing pipeline. The Repository Indexer is responsible for coordinating the full workflow: scanning code, generating embeddings, storing vectors, and managing incremental updates.
Acceptance Criteria
- Full Pipeline: Implement end-to-end indexing from repository path to searchable vector store
- Batch Processing: Efficiently process large repositories with batched embedding generation
- Progress Tracking: Provide clear progress indicators during indexing operations
- Incremental Updates: Support incremental re-indexing based on file change detection
- Error Handling: Gracefully handle scanner/embedder/storage errors with meaningful messages
- Statistics: Return indexing statistics (files scanned, documents indexed, time taken)
- Configuration: Support configurable batch sizes, parallel processing, exclusion patterns
Core Interface
interface RepositoryIndexer {
// Full repository indexing
index(repoPath: string, options?: IndexOptions): Promise<IndexStats>;
// Incremental update (only changed files)
update(repoPath: string, options?: UpdateOptions): Promise<IndexStats>;
// Search indexed content
search(query: string, options?: SearchOptions): Promise<SearchResult[]>;
// Get indexing status and statistics
getStats(): Promise<IndexStats>;
}
interface IndexOptions {
batchSize?: number; // Documents per embedding batch
excludePatterns?: string[]; // Glob patterns to exclude
languages?: string[]; // Limit to specific languages
force?: boolean; // Force re-index even if unchanged
}
interface IndexStats {
filesScanned: number;
documentsIndexed: number;
vectorsStored: number;
duration: number; // milliseconds
errors?: IndexError[];
}Technical Requirements
- Wire together ScannerRegistry, EmbeddingProvider, and VectorStore
- Implement file change detection (content hash-based)
- Support batched embedding generation for efficiency
- Track indexing metadata (timestamps, file hashes, embedder version)
- Implement progress callbacks for CLI integration
- Add comprehensive integration tests using all three components
- Handle edge cases: empty repos, binary files, very large files
Integration Flow
1. Walk repository file tree
2. For each file:
a. Detect language
b. Select appropriate scanner
c. Extract Documents
3. Batch Documents for embedding
4. Generate embeddings via EmbeddingProvider
5. Store vectors + metadata in VectorStore
6. Track indexing state for incremental updates
Dependencies
- Requires: Issue Implement Repository Scanner with ts-morph and Remark #3 (Scanner) - MUST be completed first
- Requires: Issue Implement Vector Storage with LanceDB and Transformers.js #4 (Vector Storage) - MUST be completed first
- Blocks: Issue Implement CLI and Integration Examples #6 (CLI Integration)
Branch: feat/repository-indexer
Priority: High
Estimate: 3 days
Parent Epic: #1
Metadata
Metadata
Assignees
Labels
No labels