AI Dev Assistant is a repository-aware, retrieval-augmented assistant for understanding, navigating, and reasoning about large Python codebases.
Instead of treating your code as plain text, it indexes structure, builds semantic representations, and answers questions grounded in the actual source tree, with optional persistent conversational memory.
At the moment, the project is CLI-first and optimized for local developer workflows. Future iterations are expected to add a Python API and a browser-based / ChatGPT-like UI.
Large codebases are hard to reason about because:
- Structure is implicit (inheritance, factories, adapters, conventions)
- Relevant context is spread across many files
- Grep and IDE search are insufficient for “why” questions
- LLMs without retrieval hallucinate or miss important details
This project solves that by:
- Parsing code into structural chunks (modules, classes, functions, methods)
- Embedding only meaningful units (not noise)
- Performing semantic search over your repository
- Expanding context using inheritance and overviews
- Passing grounded context into an LLM
- Optionally remembering past conversation turns
- ✅ Repository-scoped indexing (multiple repos supported)
- ✅ Deterministic, reproducible data pipeline
- ✅ Semantic search with FAISS
- ✅ Context expansion modes (DEBUGGING, DOCUMENTATION, FULL, …)
- ✅ Persistent conversation memory (SQLite)
- ✅ Dry-run mode for testing (no OpenAI required)
- ✅ Strong separation of concerns (tools / services / rag / infra)
- ✅ Fully testable pipeline with fixtures and golden data
- CLI-only interface
- OpenAI is the only AI backend today
- Python-centric (no JS/TS yet)
- No live code editing or refactoring
- No browser UI (yet)
The system is built as a pipeline + query layer.
This runs once per repository (or when code changes):
Repository
↓
AST-based chunking
↓
chunks.json
↓
Embeddings
↓
embeddings.json
↓
FAISS index
Artifacts are stored under:
data/<repo_name>/
The last indexed repo is tracked via:
data/LAST_ACTIVE_REPO
When you ask a question:
Question
↓
Semantic search (FAISS)
↓
Context expansion (mode-dependent)
↓
LLM reasoning
↓
Answer
If using conversational mode, memory is appended and optionally summarized.
AI Dev Assistant does not have a single “personality”. Instead, it operates in explicit conversation modes.
A mode is a policy, not a prompt hack.
It defines:
- what kind of context is retrieved
- how much structure is expanded (inheritance, overviews)
- whether an LLM is used at all
- how answers should be framed
- what the user’s intent is assumed to be
Modes are:
- explicitly selected by the user
- stored per conversation
- stable identifiers (safe for DB, JSON, CLI flags)
This avoids hidden heuristics and makes behavior predictable.
Depending on the selected mode, the system changes:
- retrieval depth (how much related code is pulled in)
- context expansion (inheritance, project overview)
- prompt directives (what the LLM is told to focus on)
- answer style (locations vs explanations vs guidance)
- whether an LLM is used at all
Under the hood, each mode maps to a declarative ModePolicy.
Purpose: Fast code location and discovery.
Behavior:
- Semantic search only
- No LLM calls
- Minimal context
- No explanations unless explicitly requested
Best for:
- “Where is X defined?”
- “What files mention Y?”
- Quick navigation in large repos
Purpose: Explain what the code does and how it is intended to be used.
Behavior:
- Uses retrieval + LLM
- Injects project overview
- Limited inheritance expansion
- Avoids deep implementation details
Best for:
- Understanding APIs
- Onboarding to unfamiliar code
- Generating human-style documentation
Purpose: Explain runtime behavior and failure modes.
Behavior:
- Uses retrieval + LLM
- Prefers full code over summaries
- Deeper inheritance expansion
- Focuses on edge cases and control flow
Best for:
- “Why does this crash?”
- “What happens if X is None?”
- Tracing unexpected behavior
Purpose: Help write or modify code.
Behavior:
- Uses retrieval + LLM
- Prefers full code context
- Concrete, implementation-oriented answers
- Avoids vague advice
Best for:
- “How should I implement X here?”
- “What pattern does this codebase use for Y?”
- Guided refactoring
Purpose: Explain system-level structure and design intent.
Behavior:
- Uses retrieval + LLM
- Strong inheritance expansion
- Injects project overview
- Focuses on relationships and data flow
Best for:
- Understanding how components fit together
- Explaining design decisions
- High-level system reasoning
Purpose: Balanced, general-purpose exploration.
Behavior:
- Uses retrieval + LLM
- Moderate context expansion
- Mix of overview and detail
- Less opinionated than other modes
Best for:
- Open-ended questions
- Initial exploration of a new area
- “Tell me about this part of the code”
Purpose: Maximum context, maximum detail.
Behavior:
- Uses retrieval + LLM
- Full code preferred
- Deep inheritance expansion
- Project overview injected
- Minimal filtering
Best for:
- Deep dives
- Complex reasoning
- When you explicitly want “everything”
Most AI tools try to infer intent.
This project intentionally does not.
Reasons:
- Inference is brittle
- Hidden behavior is hard to debug
- Developers prefer control
- Different tasks require different reasoning strategies
By making modes explicit:
- Behavior is predictable
- Answers are more consistent
- The system is easier to extend
- UIs can present clear choices
In the future, UIs may suggest modes — but the mode will always remain visible and overridable.
At a high level, your memory system is:
A bounded, summarizing conversation state that persists across runs and stays small enough to remain useful.
The assistant does not replay the full chat history to the LLM.
Instead, it maintains two layers of memory:
- A compact summary of older conversation
- A short window of recent turns
This gives you:
- continuity
- relevance
- bounded token usage
- long-running conversations without degradation
A conversation state consists of:
ConversationState
├── summary # Optional[str]
└── recent_turns # List[ConversationTurn]
Each turn is:
ConversationTurn
├── role: "user" | "assistant"
└── content: str
When a new conversation is created:
summary = Nonerecent_turns = []
No memory exists yet.
Each user query and assistant answer is appended via:
append_turn(state, role, content)This is pure domain logic:
- no persistence
- no LLM calls
- no side effects
After a configurable number of turns (default: max_turns = 6):
needs_summarization(state) == TrueThis triggers summarization outside the memory module.
Important design decision:
rag.memoryonly decides when summarization is needed — it does not perform it.
Another layer (service / app):
-
Builds a summarization prompt using:
- existing summary (if any)
- recent dialogue
-
Calls the LLM
-
Produces a new, compact summary
The prompt is constructed by:
build_summarization_prompt(summary, turns)This ensures:
- technical tone
- factual preservation
- removal of redundancy
After summarization:
apply_summary(state, new_summary)This:
- replaces the old summary
- truncates
recent_turnsto the last N turns (default: 2)
Result:
- long-term context is preserved
- short-term context stays sharp
- memory size stays bounded
Before answering a new question, memory is rendered as text via:
build_memory_context(state)Output looks like:
Conversation summary:
<compact technical summary>
Recent conversation:
User: ...
Assistant: ...
This text is prepended to the LLM prompt.
Persistence is handled outside rag.memory, in:
infra/memory_sqlite.py
Key properties:
- One row per
conversation_id - State stored as JSON
- Fully replace-on-update (simple & robust)
- No schema coupling to domain logic
This cleanly separates:
| Concern | Module |
|---|---|
| Memory semantics | rag.memory |
| Storage | infra.memory_sqlite |
| LLM summarization | services.memory_summary |
| Orchestration | app.ask_with_memory |
✅ Bounded memory (no runaway prompts)
✅ Explicit summarization policy
✅ Testable without OpenAI
✅ Persistent across CLI runs
✅ Easy to swap storage backend
✅ Easy to add alternative summarizers
Generated workspace (not source code).
data/
├── <repo_name>/
│ ├── chunks.json # structural code chunks
│ ├── embeddings.json # vector embeddings
│ ├── faiss.index # FAISS index
│ ├── faiss_meta.json
│ ├── memory.sqlite.db # conversation memory
│ └── chunks.preview.yaml # human-readable preview
└── LAST_ACTIVE_REPO
Safe to delete and regenerate via the pipeline.
Thin wrappers around the core logic.
index_repo.py— chunk a repositoryrebuild_embeddings.py— compute embeddingsbuild_vector_store.py— build FAISS indexexport_yaml_preview.py— readable chunk previewinit_data.py— run the full pipelineask.py— one-shot question answeringinspect_repo.py— search / inspect without LLMchat.py— interactive conversational CLI with memory
These are building blocks, not UI.
index_repo.pyrebuild_embeddings.pybuild_vector_store.pyexport_yaml_preview.pyinit_data.pydefaults.py— workspace + repo resolutionutils.py
Core “intelligence” layer.
chunking.py— AST → structural chunksschema.py—CodeChunkdefinitionsemantic_search.py— FAISS queryingembedding_pipeline.py— filtering + embedding logicembedding_policy.py— what gets embedded (and why)context.py— context expansionmodes.py— DEBUGGING / DOCUMENTATION / FULLmemory.py— in-memory conversation stateoverviews.py— high-level repo summariescost.py— token & cost estimation
High-level, testable services.
search.py— semantic search servicecontext.py— context assemblyexplain.py— LLM explanationmemory_summary.py— memory summarization
ask.py— stateless Q&Aask_with_memory.py— conversational Q&A
openai_client.py— OpenAI client creationai_client.py— abstraction for future providersembeddings.py— low-level embedding callsllm_reasoning.py— chat completionsmemory_sqlite.py— persistent memoryconfig.py— env flags (DRY_RUN, models, etc.)
pip install -e .python -m ai_dev_assistant.cli.init_data --repo /path/to/repoThis runs:
- Chunking
- Embeddings
- FAISS index
- YAML preview
python -m ai_dev_assistant.cli.ask "How does FmriprepAdapterFactory work?"python -m ai_dev_assistant.cli.chatFeatures:
- Persistent memory per conversation
- Context-aware follow-up questions
- Automatic summarization
python -m ai_dev_assistant.cli.inspect_repo "adapter factory"Useful when you want zero AI calls.
This project uses OpenAI models for:
- Embeddings (semantic search over your repository)
- LLM reasoning (explanations, debugging, documentation, summaries)
You must provide your own OpenAI API key to enable these features.
Export your key as an environment variable:
export OPENAI_API_KEY=sk-...To make this permanent, add it to your shell config (~/.bashrc, ~/.zshrc, etc.).
The assistant will fail fast with a clear error message if the key is missing and an AI call is required.
Costs depend on:
- repository size
- number of chunks embedded
- how often you query
- selected conversation mode
Typical ballpark numbers:
- Small repo (few thousand lines): fractions of a cent
- Medium repo (tens of thousands of lines): a few cents
- Large repo: still typically well under $1
Embeddings are cached on disk and only need to be regenerated when the code changes.
- Most questions cost a few cents or less
- DEBUGGING / FULL modes are more expensive than SEARCH
- Conversational memory adds minimal overhead due to summarization
The project estimates token usage before embedding and prints the expected cost.
For testing, development, or CI, you can disable all OpenAI calls:
export AI_DEV_ASSISTANT_DRY_RUN=1In dry-run mode:
- No embeddings are generated
- No LLM calls are made
- Pipelines still run end-to-end
- Tests use precomputed golden data
This makes the project safe and cheap to develop and test locally.
This project:
- does not proxy or store your API key
- does not send code unless explicitly embedding or querying
- never makes silent API calls
All AI usage is explicit, inspectable, and opt-in.
-
No OpenAI required for tests
-
AI_DEV_ASSISTANT_DRY_RUN=1disables all AI calls -
Tests use:
- isolated temp data workspaces
- mini repositories
- precomputed golden artifacts
This makes CI reliable and cheap.
- 🔌 Multiple AI backends (local / hosted)
- 🌐 Browser-based UI (ChatGPT-like)
- 🧠 Better cross-file reasoning
- 🧪 Smarter chunking for non-Python languages
- ✍️ Code navigation + refactoring suggestions
This project is intentionally:
- Explicit over magical
- Composable over monolithic
- Testable over clever
- Grounded over hallucinated
It treats LLMs as reasoning engines, not oracles.