Skip to content

MirageLyu/codebase-indexer

Repository files navigation

Codebase Indexer

CI codecov

A high-performance, local-first code semantic search engine built with Rust. Indexes your codebase into vector embeddings for RAG and semantic code search — all without sending code to the cloud.

Why

IDE-integrated code search (Cursor, Copilot) typically relies on cloud infrastructure. This project provides a fully offline alternative: parse, embed, and search your codebase locally with near-zero latency, ensuring complete code privacy.

Features

  • Local-First — All indexing and embedding run on your machine.
  • AST-Aware Chunkingtree-sitter parses code structure (functions, classes, modules).
  • Hybrid Search — Vector similarity + BM25 full-text search with Reciprocal Rank Fusion.
  • Multi-Stage Pipeline — Scanner → Batcher → Writer with multi-threaded scanning.
  • Pluggable Embedders — FastEmbed (default, local ONNX) or TEI (HTTP server).
  • Two-Stage Retrieval — Optional cross-encoder reranker for improved precision.
  • Content Deduplication — Hash-based dedup to skip re-indexing unchanged code.
  • Incremental Updates — File watcher mode for continuous indexing on code changes.
  • IDE Integration — LSP server (VS Code / Cursor extension) and MCP server.

Quick Start

Prerequisites

  • Rust 1.88+
  • protocbrew install protobuf (macOS) / apt install protobuf-compiler (Linux), or make setup ARGS="--all"

Build & Run

git clone https://github.com/MirageLyu/codebase-indexer.git
cd codebase-indexer

make setup          # download ONNX models & runtime (~50MB)
make build          # compile workspace
make index ROOT=./your-project
make search QUERY="parse date string"

CLI Reference

indexer-cli [OPTIONS] <COMMAND>

Commands:
  index     Index a codebase
  search    Semantic search over indexed code

Global Options:
  -r, --root <PATH>            Codebase root [default: .]
      --db-path <PATH>         LanceDB directory [default: ~/.codebase-index/lancedb]
      --metadata-path <PATH>   SQLite metadata [default: ~/.codebase-index/metadata.db]
      --force                  Force full re-index
      --embedder <BACKEND>     fastembed | tei [default: fastembed]
      --tei-url <URL>          TEI server URL [default: http://localhost:8080]

Index Options:
      --watch                  Watch for file changes after initial scan

Search Options:
  <QUERY>                      Search query string
  -l, --limit <N>              Max results [default: 5]
      --json                   Output as JSON
      --reranker <BACKEND>     Enable reranker (e.g. "local")
      --profile <PROFILE>      user (precision, default) | agent (broad recall)

Supported Languages

Language Extensions
Rust .rs
Python .py
JavaScript .js, .jsx
TypeScript .ts, .tsx
Go .go

Configuration

All configuration is via environment variables or CLI flags. See .env.example for a template.

Variable Description Default
RUST_LOG Log level (error/warn/info/debug)
CODEBASE_INDEX_DIR Base directory for index storage ~/.codebase-index
CODEBASE_MODELS_DIR Path to model files ./models (auto-resolved)
ORT_DYLIB_PATH ONNX Runtime shared library path ./libs/ (auto-resolved)
CODEBASE_ONNX_MODEL Override ONNX model file auto (qint8 per arch)
TEI_URL TEI server URL http://localhost:8080
TEI_MAX_BATCH_SIZE TEI batch size per request 32
PROTOC Path to protoc binary auto-detected

VS Code / Cursor Extension & LSP

The extension communicates with indexer-lsp over stdio. Prerequisites: make setup (models & ONNX Runtime) and Node.js / npm.

Quick Start (Recommended)

make ext-dev

This compiles indexer-lsp in release mode, bundles the extension with webpack, and opens an Extension Development Host window.

Step-by-Step

make ext-build    # build LSP binary + bundle extension

# Launch dev host manually:
export PATH="$(pwd)/target/release:$PATH"
cursor --extensionDevelopmentPath="$(pwd)/vscode-codebase-indexer" "$(pwd)"

Debugging with F5

Open vscode-codebase-indexer/ as the workspace root in VS Code / Cursor, run npm install, then use Run and Debug → Run Extension → F5. The launch config will build indexer-lsp, bundle the extension, and open an Extension Development Host.

Standalone LSP

RUST_LOG=debug cargo run --release -p indexer-lsp

Custom Binary Path

Set codebaseIndexer.serverPath in VS Code settings to an absolute path (e.g. .../target/release/indexer-lsp or .../target/debug/indexer-lsp).

Project Structure

codebase-indexer/
├── src/
│   ├── bins/
│   │   ├── indexer-cli/         # CLI binary (clap)
│   │   ├── indexer-lsp/         # LSP server (tower-lsp)
│   │   └── indexer-mcp/         # MCP server (rmcp)
│   └── crates/
│       ├── core/                # Domain types & port traits
│       ├── ingestion/           # Scanner, Parser, Embedders, Reranker, Watcher
│       ├── storage/             # LanceDB (vectors) + SQLite (metadata) + Tantivy (BM25)
│       ├── search/              # Hybrid search engine (vector + BM25 + rerank)
│       └── common/              # Shared config, errors, telemetry
├── vscode-codebase-indexer/     # VS Code / Cursor extension (TypeScript)
├── evaluation/                  # Retrieval quality evaluation scripts & datasets
├── docs/                        # Architecture docs & guides
├── scripts/                     # Setup & dev scripts
├── Cargo.toml                   # Workspace manifest
└── Makefile                     # Build, index, search, eval shortcuts

Development

make build       # compile workspace
make test        # run tests
make lint        # clippy (strict)
make tei-mock    # start mock TEI server

Evaluation

pip install huggingface_hub
python3 evaluation/prepare_cosqa_plus.py          # download CoSQA+ dataset (~35MB)
make eval DATASET=cosqa_plus_small EMBEDDER=fastembed

See evaluation/EVAL_FLOW.md (English) for the full evaluation pipeline.

Documentation

Document Chinese English
Architecture Design V1 zh en
Indexing Architecture V2 zh en
Model Selection zh en
TEI Installation zh en
Performance Benchmark zh en
Code Search Benchmarks zh en
IDE Indexing Research zh en
BM25 Hybrid Search zh en
Index Size Analysis zh en
Chunk Strategy zh en
Evaluation Flow zh en

Tech Stack

Component Technology
Language Rust
CLI clap
AST Parsing tree-sitter
Embedding FastEmbed / ONNX Runtime
Vector Store LanceDB
Full-Text Search Tantivy (BM25)
Metadata Store SQLite (sqlx)
Async Runtime Tokio
File Watching notify

License

MIT

About

Local codebase embedding indexer with CLI, LSP(vscode plugin) and MCP

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors