Skip to content

Implement Repository Indexer - Integration Layer #12

@prosdev

Description

@prosdev

Description

Implement the orchestration layer that integrates the Repository Scanner (#3), Embedding Provider, and Vector Storage (#4) into a cohesive indexing pipeline. The Repository Indexer is responsible for coordinating the full workflow: scanning code, generating embeddings, storing vectors, and managing incremental updates.

Acceptance Criteria

  • Full Pipeline: Implement end-to-end indexing from repository path to searchable vector store
  • Batch Processing: Efficiently process large repositories with batched embedding generation
  • Progress Tracking: Provide clear progress indicators during indexing operations
  • Incremental Updates: Support incremental re-indexing based on file change detection
  • Error Handling: Gracefully handle scanner/embedder/storage errors with meaningful messages
  • Statistics: Return indexing statistics (files scanned, documents indexed, time taken)
  • Configuration: Support configurable batch sizes, parallel processing, exclusion patterns

Core Interface

interface RepositoryIndexer {
  // Full repository indexing
  index(repoPath: string, options?: IndexOptions): Promise<IndexStats>;
  
  // Incremental update (only changed files)
  update(repoPath: string, options?: UpdateOptions): Promise<IndexStats>;
  
  // Search indexed content
  search(query: string, options?: SearchOptions): Promise<SearchResult[]>;
  
  // Get indexing status and statistics
  getStats(): Promise<IndexStats>;
}

interface IndexOptions {
  batchSize?: number;      // Documents per embedding batch
  excludePatterns?: string[]; // Glob patterns to exclude
  languages?: string[];    // Limit to specific languages
  force?: boolean;         // Force re-index even if unchanged
}

interface IndexStats {
  filesScanned: number;
  documentsIndexed: number;
  vectorsStored: number;
  duration: number;        // milliseconds
  errors?: IndexError[];
}

Technical Requirements

  • Wire together ScannerRegistry, EmbeddingProvider, and VectorStore
  • Implement file change detection (content hash-based)
  • Support batched embedding generation for efficiency
  • Track indexing metadata (timestamps, file hashes, embedder version)
  • Implement progress callbacks for CLI integration
  • Add comprehensive integration tests using all three components
  • Handle edge cases: empty repos, binary files, very large files

Integration Flow

1. Walk repository file tree
2. For each file:
   a. Detect language
   b. Select appropriate scanner
   c. Extract Documents
3. Batch Documents for embedding
4. Generate embeddings via EmbeddingProvider
5. Store vectors + metadata in VectorStore
6. Track indexing state for incremental updates

Dependencies

Branch: feat/repository-indexer
Priority: High
Estimate: 3 days
Parent Epic: #1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions