Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .clinerules
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,13 @@ CodeCritique is an AI-powered code review tool that:
- Avoid raw `console.log(...)` in normal code paths
- Document logging-related options such as `verbose` in JSDoc and pass them through to downstream helpers

### Incremental Embeddings

- Treat `content_hash` as the source of truth for embedding freshness. Do not rely on `mtime` for correctness, especially in CI.
- Full `embeddings:generate` scans may prune stale file and document embeddings. Partial `--files` runs must not prune unrelated embeddings.
- Project-structure embeddings and retrieval are scoped by full `project_path`, not directory basename.
- Retrieval should only use project-scoped rows whose backing files still exist.

### Key Commands

```bash
Expand Down
7 changes: 7 additions & 0 deletions .cursorrules
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,13 @@ CodeCritique is an AI-powered code review tool that:
- Avoid raw `console.log(...)` in normal code paths
- Document logging-related options such as `verbose` in JSDoc and pass them through to downstream helpers

### Incremental Embeddings

- Treat `content_hash` as the source of truth for embedding freshness. Do not rely on `mtime` for correctness, especially in CI.
- Full `embeddings:generate` scans may prune stale file and document embeddings. Partial `--files` runs must not prune unrelated embeddings.
- Project-structure embeddings and retrieval are scoped by full `project_path`, not directory basename.
- Retrieval should only use project-scoped rows whose backing files still exist.

### Key Commands

```bash
Expand Down
7 changes: 7 additions & 0 deletions .roo/rules/01-project-rules.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,13 @@ For comprehensive guidelines, see **AGENTS.md** in the project root.
- Avoid raw `console.log(...)` in normal code paths
- Document logging-related options such as `verbose` in JSDoc and pass them through to downstream helpers

### Incremental Embeddings

- Treat `content_hash` as the source of truth for embedding freshness. Do not rely on `mtime` for correctness, especially in CI.
- Full `embeddings:generate` scans may prune stale file and document embeddings. Partial `--files` runs must not prune unrelated embeddings.
- Project-structure embeddings and retrieval are scoped by full `project_path`, not directory basename.
- Retrieval should only use project-scoped rows whose backing files still exist.

### Key Commands

```bash
Expand Down
11 changes: 11 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,16 @@ This document provides comprehensive guidance for AI coding agents working on th
2. **Analyze PR History** (optional): Fetch PR comments from GitHub → extract patterns → store for context
3. **Code Review**: Retrieve similar code examples + documentation → provide context to LLM → generate review

### Incremental Embeddings Rules

- Treat `content_hash` as the source of truth for embedding freshness. Do not rely on `mtime` to decide whether embeddings are current, especially in CI.
- `embeddings:generate` has two modes:
- Full directory scans update changed files and prune stale embeddings for deleted files and documents.
- Partial `--files` runs update only the requested files and must not prune unrelated embeddings.
- Project-structure embeddings are scoped by full `project_path`, not by directory basename.
- Retrieval should only return project-scoped context and should validate that referenced files still exist before using stored content.
- Project summaries should be refreshed when the embedding inventory changes, not only when previously tracked key files change.

---

## Directory Structure
Expand Down Expand Up @@ -114,6 +124,7 @@ src/
│ ├── logging.js # Logging utilities
│ ├── markdown.js # Markdown processing
│ ├── mobilebert-tokenizer.js # Tokenization utilities
│ ├── path-utils.js # Project path boundary helpers
│ ├── pr-chunking.js # PR chunking for large diffs
│ └── string-utils.js # String manipulation utilities
Expand Down
7 changes: 7 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,13 @@ CodeCritique is an AI-powered code review tool that:
- Avoid raw `console.log(...)` in normal code paths
- Document logging-related options such as `verbose` in JSDoc and pass them through to downstream helpers

### Incremental Embeddings

- Treat `content_hash` as the source of truth for embedding freshness. Do not rely on `mtime` for correctness, especially in CI.
- Full `embeddings:generate` scans may prune stale file and document embeddings. Partial `--files` runs must not prune unrelated embeddings.
- Project-structure embeddings and retrieval are scoped by full `project_path`, not directory basename.
- Retrieval should only use project-scoped rows whose backing files still exist.

### Key Commands

```bash
Expand Down
1 change: 1 addition & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ src/
├── utils/ # Utility functions
│ ├── git.js # Git operations
│ ├── file-validation.js # File validation
│ ├── path-utils.js # Project path boundary helpers
│ └── ...
└── test-utils/ # Test utilities and fixtures
.github/
Expand Down
14 changes: 14 additions & 0 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,18 @@ Additionally, [Hugging Face Transformers.js](https://github.com/huggingface/tran

[LanceDB](https://lancedb.com/) stores embeddings for fast similarity search. The vector database maintains an index of all embedded code examples, documentation, and review patterns, enabling efficient retrieval of relevant context. LanceDB's columnar storage format provides excellent performance for similarity searches across large codebases. It supports hybrid search combining vector similarity and full-text search with Reciprocal Rank Fusion (RRF) for optimal results.

### Incremental Embedding Generation

Embedding generation is incremental and project-scoped. The system stores file embeddings, documentation chunk embeddings, and project-structure embeddings with a `project_path` so multiple repositories can share the same local LanceDB instance without mixing context.

Freshness is determined by `content_hash`, not by file modification time. This is important in CI environments where checkout operations can rewrite `mtime` values even when file contents are unchanged. During generation:

- Full `embeddings:generate` directory scans update new or changed files and prune stale embeddings for deleted files and removed documentation.
- Partial `embeddings:generate --files ...` runs update only the requested files and do not prune unrelated embeddings.
- Project-structure embeddings are regenerated only when the structure content changes and are keyed by full project identity rather than directory basename.

This design keeps review context current without forcing full regeneration on every run.

### Context Retrieval

Finds relevant code examples, documentation, and historical patterns. When reviewing code, the system queries the vector database to retrieve:
Expand All @@ -42,6 +54,8 @@ Finds relevant code examples, documentation, and historical patterns. When revie
- Historical review patterns from past pull requests that show how similar issues were addressed
- Custom guidelines and rules specific to your project

Retrieval is also defensive about freshness and project isolation. Results are filtered by `project_path` when available, validated against project path boundaries, and checked for file existence before stored content is used in review prompts. This prevents stale embeddings from deleted files and cross-project collisions from leaking into code review context.

### LLM Integration

[Anthropic Claude](https://www.anthropic.com/) (claude-sonnet-4-5 by default) analyzes code with rich contextual information. The retrieved context is combined with the code being reviewed and sent to Claude, which performs a comprehensive analysis considering:
Expand Down
210 changes: 73 additions & 137 deletions src/content-retrieval.js
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ import { inferContextFromDocumentContent } from './utils/context-inference.js';
import { isGenericDocument, getGenericDocumentContext } from './utils/document-detection.js';
import { isDocumentationFile } from './utils/file-validation.js';
import { debug, verboseLog } from './utils/logging.js';
import { isPathWithinProject } from './utils/path-utils.js';
import { escapeSqlString } from './utils/string-utils.js';

const FILE_EMBEDDINGS_TABLE = TABLE_NAMES.FILE_EMBEDDINGS;
const DOCUMENT_CHUNK_TABLE = TABLE_NAMES.DOCUMENT_CHUNK;
Expand Down Expand Up @@ -54,6 +56,58 @@ export class ContentRetriever {
this.cleaningUp = false;
}

resolveProjectResultPath(filePath, resolvedProjectPath) {
if (!filePath) {
return null;
}

const absolutePath = path.isAbsolute(filePath) ? path.resolve(filePath) : path.resolve(resolvedProjectPath, filePath);
return isPathWithinProject(absolutePath, resolvedProjectPath) ? absolutePath : null;
}

async filterResultsForProject(results, resolvedProjectPath, getPath) {
const resultsToCheck = [];
const projectMatchMap = new Map();

for (let i = 0; i < results.length; i++) {
const result = results[i];
const resultPath = getPath(result);

if (result.project_path && result.project_path !== resolvedProjectPath) {
projectMatchMap.set(i, false);
continue;
}

const absolutePath = this.resolveProjectResultPath(resultPath, resolvedProjectPath);
if (!absolutePath) {
projectMatchMap.set(i, false);
continue;
}

resultsToCheck.push({ index: i, absolutePath, resultPath });
}

if (resultsToCheck.length > 0) {
const existenceResults = await Promise.all(
resultsToCheck.map(async ({ index, absolutePath, resultPath }) => {
try {
await fs.promises.access(absolutePath, fs.constants.F_OK);
return { index, exists: true };
} catch {
debug(`Filtering out non-existent project file: ${resultPath}`);
return { index, exists: false };
}
})
);

for (const { index, exists } of existenceResults) {
projectMatchMap.set(index, exists);
}
}

return results.filter((result, index) => projectMatchMap.get(index) === true);
}

/**
* Find relevant documentation with sophisticated reranking
* @param {string} queryText - The search query
Expand Down Expand Up @@ -99,7 +153,7 @@ export class ContentRetriever {
try {
const tableSchema = await table.schema;
if (tableSchema?.fields?.some((field) => field.name === 'project_path')) {
query = query.where(`project_path = '${resolvedProjectPath.replace(/'/g, "''")}'`);
query = query.where(`project_path = '${escapeSqlString(resolvedProjectPath)}'`);
debug(`Filtering documentation by project_path: ${resolvedProjectPath}`);
}
} catch (schemaError) {
Expand All @@ -109,65 +163,11 @@ export class ContentRetriever {
const results = await query.limit(Math.max(limit * 3, 20)).toArray();
verboseLog(options, chalk.green(`Native hybrid search returned ${results.length} documentation results`));

// OPTIMIZATION: Enhanced batch file existence checks with parallel processing
const docsToCheck = [];
const docProjectMatchMap = new Map();

// First pass: collect files that need existence checking
for (let i = 0; i < results.length; i++) {
const result = results[i];

if (result.project_path) {
docProjectMatchMap.set(i, result.project_path === resolvedProjectPath);
continue;
}

if (!result.original_document_path) {
docProjectMatchMap.set(i, false);
continue;
}

const filePath = result.original_document_path;
try {
if (path.isAbsolute(filePath)) {
docProjectMatchMap.set(i, filePath.startsWith(resolvedProjectPath));
continue;
}

const absolutePath = path.resolve(resolvedProjectPath, filePath);
if (absolutePath.startsWith(resolvedProjectPath)) {
// Mark for batch existence check
docsToCheck.push({ result, index: i, absolutePath, filePath });
} else {
docProjectMatchMap.set(i, false);
}
} catch (error) {
debug(`Error filtering result for project: ${error.message}`);
docProjectMatchMap.set(i, false);
}
}

// Enhanced batch check file existence with improved error handling
if (docsToCheck.length > 0) {
debug(`[OPTIMIZATION] Batch checking existence of ${docsToCheck.length} documentation files`);
const existencePromises = docsToCheck.map(async ({ index, absolutePath, filePath }) => {
try {
await fs.promises.access(absolutePath, fs.constants.F_OK);
return { index, exists: true };
} catch {
debug(`Filtering out non-existent documentation file: ${filePath}`);
return { index, exists: false };
}
});

const existenceResults = await Promise.all(existencePromises);
for (const { index, exists } of existenceResults) {
docProjectMatchMap.set(index, exists);
}
}

// Filter results based on project match using the map
const projectFilteredResults = results.filter((result, index) => docProjectMatchMap.get(index) === true);
const projectFilteredResults = await this.filterResultsForProject(
results,
resolvedProjectPath,
(result) => result.original_document_path
);

verboseLog(options, chalk.blue(`Filtered to ${projectFilteredResults.length} documentation results from current project`));
let finalResults = projectFilteredResults.map((result) => {
Expand Down Expand Up @@ -493,13 +493,13 @@ export class ContentRetriever {
if (queryFilePath) {
const normalizedQueryPath = path.resolve(resolvedProjectPath, queryFilePath);
// Add condition to exclude the file being reviewed
const escapedPath = normalizedQueryPath.replace(/'/g, "''");
const escapedPath = escapeSqlString(normalizedQueryPath);
conditions.push(`path != '${escapedPath}'`);

// Also check for relative path variants to be thorough
const relativePath = path.relative(resolvedProjectPath, normalizedQueryPath);
if (relativePath && !relativePath.startsWith('..')) {
const escapedRelativePath = relativePath.replace(/'/g, "''");
const escapedRelativePath = escapeSqlString(relativePath);
conditions.push(`path != '${escapedRelativePath}'`);
}

Expand All @@ -515,7 +515,7 @@ export class ContentRetriever {

if (hasProjectPathField) {
// Use exact match for project path
conditions.push(`project_path = '${resolvedProjectPath.replace(/'/g, "''")}'`);
conditions.push(`project_path = '${escapeSqlString(resolvedProjectPath)}'`);
debug(`Filtering by project_path: ${resolvedProjectPath}`);
}
}
Expand All @@ -532,72 +532,11 @@ export class ContentRetriever {

verboseLog(options, chalk.green(`Native hybrid search returned ${results.length} results`));

// OPTIMIZATION: Batch file existence checks for better performance
const resultsToCheck = [];
const projectMatchMap = new Map();

// First pass: collect files that need existence checking
for (let i = 0; i < results.length; i++) {
const result = results[i];

// Use project_path field if available (new schema)
if (result.project_path) {
projectMatchMap.set(i, result.project_path === resolvedProjectPath);
continue;
}

// Fallback for old embeddings without project_path field
if (!result.path && !result.original_document_path) {
projectMatchMap.set(i, false);
continue;
}

const filePath = result.original_document_path || result.path;
try {
// Check if this result belongs to the current project
// First try as absolute path
if (path.isAbsolute(filePath)) {
projectMatchMap.set(i, filePath.startsWith(resolvedProjectPath));
continue;
}

// For relative paths, check if the file actually exists in the project
const absolutePath = path.resolve(resolvedProjectPath, filePath);

// Verify the path is within project bounds
if (absolutePath.startsWith(resolvedProjectPath)) {
// Mark for batch existence check
resultsToCheck.push({ result, index: i, absolutePath });
} else {
projectMatchMap.set(i, false);
}
} catch (error) {
debug(`Error filtering result for project: ${error.message}`);
projectMatchMap.set(i, false);
}
}

// Batch check file existence for better performance
if (resultsToCheck.length > 0) {
debug(`[OPTIMIZATION] Batch checking existence of ${resultsToCheck.length} files`);
const existencePromises = resultsToCheck.map(async ({ result, index, absolutePath }) => {
try {
await fs.promises.access(absolutePath, fs.constants.F_OK);
return { index, exists: true };
} catch {
debug(`Filtering out non-existent file: ${result.original_document_path || result.path}`);
return { index, exists: false };
}
});

const existenceResults = await Promise.all(existencePromises);
for (const { index, exists } of existenceResults) {
projectMatchMap.set(index, exists);
}
}

// Filter results based on project match using the map
const projectFilteredResults = results.filter((result, index) => projectMatchMap.get(index) === true);
const projectFilteredResults = await this.filterResultsForProject(
results,
resolvedProjectPath,
(result) => result.original_document_path || result.path
);

verboseLog(options, chalk.blue(`Filtered to ${projectFilteredResults.length} results from current project`));

Expand Down Expand Up @@ -643,14 +582,11 @@ export class ContentRetriever {
try {
const fileTable = await this.database.getTable(FILE_EMBEDDINGS_TABLE);
if (fileTable) {
// Look for project-specific structure ID
const projectStructureId = `__project_structure__${path.basename(resolvedProjectPath)}`;
let structureResults = await fileTable.query().where(`id = '${projectStructureId}'`).limit(1).toArray();

// Fall back to generic project structure if project-specific one doesn't exist
if (structureResults.length === 0) {
structureResults = await fileTable.query().where("id = '__project_structure__'").limit(1).toArray();
}
const structureResults = await fileTable
.query()
.where(`project_path = '${escapeSqlString(resolvedProjectPath)}' AND type = 'directory-structure'`)
.limit(1)
.toArray();

if (structureResults.length > 0) {
const structureRecord = structureResults[0];
Expand Down
Loading