A high-performance, local-first code semantic search engine built with Rust. Indexes your codebase into vector embeddings for RAG and semantic code search — all without sending code to the cloud.
IDE-integrated code search (Cursor, Copilot) typically relies on cloud infrastructure. This project provides a fully offline alternative: parse, embed, and search your codebase locally with near-zero latency, ensuring complete code privacy.
- Local-First — All indexing and embedding run on your machine.
- AST-Aware Chunking — tree-sitter parses code structure (functions, classes, modules).
- Hybrid Search — Vector similarity + BM25 full-text search with Reciprocal Rank Fusion.
- Multi-Stage Pipeline — Scanner → Batcher → Writer with multi-threaded scanning.
- Pluggable Embedders — FastEmbed (default, local ONNX) or TEI (HTTP server).
- Two-Stage Retrieval — Optional cross-encoder reranker for improved precision.
- Content Deduplication — Hash-based dedup to skip re-indexing unchanged code.
- Incremental Updates — File watcher mode for continuous indexing on code changes.
- IDE Integration — LSP server (VS Code / Cursor extension) and MCP server.
- Rust 1.88+
- protoc —
brew install protobuf(macOS) /apt install protobuf-compiler(Linux), ormake setup ARGS="--all"
git clone https://github.com/MirageLyu/codebase-indexer.git
cd codebase-indexer
make setup # download ONNX models & runtime (~50MB)
make build # compile workspace
make index ROOT=./your-project
make search QUERY="parse date string"indexer-cli [OPTIONS] <COMMAND>
Commands:
index Index a codebase
search Semantic search over indexed code
Global Options:
-r, --root <PATH> Codebase root [default: .]
--db-path <PATH> LanceDB directory [default: ~/.codebase-index/lancedb]
--metadata-path <PATH> SQLite metadata [default: ~/.codebase-index/metadata.db]
--force Force full re-index
--embedder <BACKEND> fastembed | tei [default: fastembed]
--tei-url <URL> TEI server URL [default: http://localhost:8080]
Index Options:
--watch Watch for file changes after initial scan
Search Options:
<QUERY> Search query string
-l, --limit <N> Max results [default: 5]
--json Output as JSON
--reranker <BACKEND> Enable reranker (e.g. "local")
--profile <PROFILE> user (precision, default) | agent (broad recall)
| Language | Extensions |
|---|---|
| Rust | .rs |
| Python | .py |
| JavaScript | .js, .jsx |
| TypeScript | .ts, .tsx |
| Go | .go |
All configuration is via environment variables or CLI flags. See .env.example for a template.
| Variable | Description | Default |
|---|---|---|
RUST_LOG |
Log level (error/warn/info/debug) | — |
CODEBASE_INDEX_DIR |
Base directory for index storage | ~/.codebase-index |
CODEBASE_MODELS_DIR |
Path to model files | ./models (auto-resolved) |
ORT_DYLIB_PATH |
ONNX Runtime shared library path | ./libs/ (auto-resolved) |
CODEBASE_ONNX_MODEL |
Override ONNX model file | auto (qint8 per arch) |
TEI_URL |
TEI server URL | http://localhost:8080 |
TEI_MAX_BATCH_SIZE |
TEI batch size per request | 32 |
PROTOC |
Path to protoc binary | auto-detected |
The extension communicates with indexer-lsp over stdio. Prerequisites: make setup (models & ONNX Runtime) and Node.js / npm.
make ext-devThis compiles indexer-lsp in release mode, bundles the extension with webpack, and opens an Extension Development Host window.
make ext-build # build LSP binary + bundle extension
# Launch dev host manually:
export PATH="$(pwd)/target/release:$PATH"
cursor --extensionDevelopmentPath="$(pwd)/vscode-codebase-indexer" "$(pwd)"Open vscode-codebase-indexer/ as the workspace root in VS Code / Cursor, run npm install, then use Run and Debug → Run Extension → F5. The launch config will build indexer-lsp, bundle the extension, and open an Extension Development Host.
RUST_LOG=debug cargo run --release -p indexer-lspSet codebaseIndexer.serverPath in VS Code settings to an absolute path (e.g. .../target/release/indexer-lsp or .../target/debug/indexer-lsp).
codebase-indexer/
├── src/
│ ├── bins/
│ │ ├── indexer-cli/ # CLI binary (clap)
│ │ ├── indexer-lsp/ # LSP server (tower-lsp)
│ │ └── indexer-mcp/ # MCP server (rmcp)
│ └── crates/
│ ├── core/ # Domain types & port traits
│ ├── ingestion/ # Scanner, Parser, Embedders, Reranker, Watcher
│ ├── storage/ # LanceDB (vectors) + SQLite (metadata) + Tantivy (BM25)
│ ├── search/ # Hybrid search engine (vector + BM25 + rerank)
│ └── common/ # Shared config, errors, telemetry
├── vscode-codebase-indexer/ # VS Code / Cursor extension (TypeScript)
├── evaluation/ # Retrieval quality evaluation scripts & datasets
├── docs/ # Architecture docs & guides
├── scripts/ # Setup & dev scripts
├── Cargo.toml # Workspace manifest
└── Makefile # Build, index, search, eval shortcuts
make build # compile workspace
make test # run tests
make lint # clippy (strict)
make tei-mock # start mock TEI serverpip install huggingface_hub
python3 evaluation/prepare_cosqa_plus.py # download CoSQA+ dataset (~35MB)
make eval DATASET=cosqa_plus_small EMBEDDER=fastembedSee evaluation/EVAL_FLOW.md (English) for the full evaluation pipeline.
| Document | Chinese | English |
|---|---|---|
| Architecture Design V1 | zh | en |
| Indexing Architecture V2 | zh | en |
| Model Selection | zh | en |
| TEI Installation | zh | en |
| Performance Benchmark | zh | en |
| Code Search Benchmarks | zh | en |
| IDE Indexing Research | zh | en |
| BM25 Hybrid Search | zh | en |
| Index Size Analysis | zh | en |
| Chunk Strategy | zh | en |
| Evaluation Flow | zh | en |
| Component | Technology |
|---|---|
| Language | Rust |
| CLI | clap |
| AST Parsing | tree-sitter |
| Embedding | FastEmbed / ONNX Runtime |
| Vector Store | LanceDB |
| Full-Text Search | Tantivy (BM25) |
| Metadata Store | SQLite (sqlx) |
| Async Runtime | Tokio |
| File Watching | notify |
MIT