Conversation
Co-authored-by: royisme <350731+royisme@users.noreply.github.com>
Co-authored-by: royisme <350731+royisme@users.noreply.github.com>
Co-authored-by: royisme <350731+royisme@users.noreply.github.com>
Co-authored-by: royisme <350731+royisme@users.noreply.github.com>
Co-authored-by: royisme <350731+royisme@users.noreply.github.com>
|
@copilot 为什么会有v0.2这个东西,这个版本只是我们自己定义去规范roadmap的,它不应该和代码有关系啊 |
理解了。v0.2 确实只是 roadmap 规划用的版本标识,不应该出现在代码文件名中。我会移除所有带 v02/v0.2 后缀的文件,将功能集成到主代码库中。 具体修改:
我现在就进行修改。 |
- Remove all v02-suffixed files and backend/ directory structure - Integrate repository ingestion, related search, and context pack APIs into api/routes.py - Add new service modules to services/: code_ingestor, git_utils, ranker, pack_builder - Extend graph_service with repo/file creation and fulltext search methods - Keep Python 3.12 compatibility fix in pyproject.toml - Three new API endpoints: POST /ingest/repo, GET /graph/related, GET /context/pack Co-authored-by: royisme <350731+royisme@users.noreply.github.com>
|
@copilot 那你进行修改吧,另外不需要修改python的版本,这是没有必要的,还是保持之前的3.13就可以。 |
Per feedback, keeping original Python 3.13 requirement Co-authored-by: royisme <350731+royisme@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| MATCH (f:File) | ||
| WHERE ($repo_id IS NULL OR f.repoId = $repo_id) | ||
| AND (toLower(f.path) CONTAINS toLower($query_text) | ||
| OR toLower(f.lang) CONTAINS toLower($query_text) | ||
| OR ($query_text IN f.content AND f.content IS NOT NULL)) |
There was a problem hiding this comment.
Replace illegal
IN check in fulltext query
The full‑text search clause uses ($query_text IN f.content AND f.content IS NOT NULL) to match file contents. In Neo4j, IN expects the right operand to be a list, so when f.content is a string (as inserted by the ingestor) the query fails with a type mismatch rather than returning results. Both /graph/related and /context/pack depend on this method, so any repository with stored content will cause these endpoints to return 500 errors. Use toLower(f.content) CONTAINS toLower($query_text) or another string predicate instead.
Useful? React with 👍 / 👎.
| if target_dir is None: | ||
| target_dir = tempfile.mkdtemp(prefix="repo_") | ||
|
|
||
| cmd = ["git", "clone", "--depth", "1", "-b", branch, repo_url, target_dir] | ||
| result = subprocess.run( |
There was a problem hiding this comment.
Avoid cloning into a pre‑created temp directory
The remote ingest path calls git_utils.clone_repo, which generates a target directory with tempfile.mkdtemp and then passes that directory to git clone. git clone expects the destination path not to exist, so cloning into an already-created directory fails with fatal: destination path … already exists and is not an empty directory, causing every remote repository ingestion to error out. Create the parent temp dir and let git clone create the repo subdirectory (e.g. using tempfile.mkdtemp for parent and os.path.join(parent, repo_name) or tempfile.mkdtemp with dir and removing before cloning).
Useful? React with 👍 / 👎.
Adds three core APIs for code knowledge management without LLM dependency. Provides repository ingestion into Neo4j, keyword-based file search, and budget-aware context pack generation with MCP-compatible
ref://handles.Implementation
Integrated into Existing Structure
api/routes.pyservices/: code_ingestor, git_utils, ranker, pack_builderservices/graph_service.pywith repo/file operations and fulltext searchAPI Endpoints
POST /api/v1/ingest/repo- Scans repositories (local/git), detects languages, creates File/Repo nodesGET /api/v1/graph/related- Fulltext search with keyword ranking, returnsref://handlesGET /api/v1/context/pack- Builds context within token budget, prioritizes focus pathsNeo4j Schema
ref:// Handle Format
Compact MCP-compatible references for on-demand code fetching.
Usage
New Service Modules
Notes
Original prompt
codebase-rag 升级与迭代路线图(面向实现)
版本目标概览
v0.2(最小可用):提供 3 个极小 API(仓库入库、相关检索、上下文包),无需 LLM,总能返回“摘要 + 引用句柄(handle)”,可直接给 CoPal 注入。
v0.3(代码图谱):实现 Python/TS 的基础 AST 解析,写入 Neo4j:Repo/File/Symbol/IMPORTS/CALLS,支持影响面分析。
v0.4(混合检索与增量):向量检索 + 图遍历的 Related 查询、Git diff 增量入库、Context Pack 限额与去重。
v0.5(MCP 封装 / 观测):暴露 MCP 工具、一键本地启动(docker-compose)、指标与日志完善。
术语:句柄 handle 统一形如 ref://file/#Lx-Ly、ref://symbol/,Prompt 里只放“摘要+句柄”,细节由 MCP(active-file/context7)按需拉取。
目录与模块布局(落地结构)
backend/
app/
main.py
config.py
dependencies.py
routers/
ingest.py # POST /ingest/repo
graph.py # GET /graph/related, GET /graph/impact
context.py # GET /context/pack
services/
ingest/
code_ingestor.py # 代码扫描 & 调度
git_utils.py # clone/checkout、diff
graph/
neo4j_service.py # 连接池、读写/索引/查询
schema.cypher # 约束&索引
queries.py # 预置查询
analysis/
ast_python.py # v0.3
ast_typescript.py # v0.3
ranking/
ranker.py # BM25/关键词;v0.4加向量混合
context/
pack_builder.py # 预算与去重策略
models/
ingest_models.py
graph_models.py
context_models.py
tests/
test_ingest.py
test_related.py
test_context_pack.py
scripts/
neo4j_bootstrap.sh
demo_curl.sh
v0.2(最小可用版)——三接口打通(本周完成)
文件:services/graph/schema.cypher
// Repo
CREATE CONSTRAINT repo_key IF NOT EXISTS
FOR (r:Repo) REQUIRE (r.id) IS UNIQUE;
// File
CREATE CONSTRAINT file_key IF NOT EXISTS
FOR (f:File) REQUIRE (f.repoId, f.path) IS NODE KEY;
CREATE FULLTEXT INDEX file_text IF NOT EXISTS
FOR (f:File) ON EACH [f.path, f.lang];
// Symbol(v0.3用,先占位)
CREATE CONSTRAINT sym_key IF NOT EXISTS
FOR (s:Symbol) REQUIRE (s.id) IS UNIQUE;
启动脚本:scripts/neo4j_bootstrap.sh 执行 schema。
路由:POST /ingest/repo(routers/ingest.py)
请求(models/ingest_models.py):
class IngestRepoRequest(BaseModel):
repo_url: Optional[str] = None # 远程
local_path: Optional[str] = None # 本地
branch: Optional[str] = "main"
include_globs: list[str] = ["/*.py", "/.ts", "**/.tsx"]
exclude_globs: list[str] = ["/node_modules/", "/.git/"]
响应:
class IngestRepoResponse(BaseModel):
task_id: str
status: Literal["queued","running","done","error"]
行为:
v0.2:同步扫描匹配到的文件 → Neo4j 写 (:Repo {id}) 与 (:File {repoId, path, lang, size, sha?})-[:IN_REPO]->(:Repo)。
保留 task_id 字段,便于 v0.4 切换到异步。
验收:
curl -X POST /ingest/repo -d '{...}' 返回 queued/done,Neo4j 可见节点数增长。
路由:GET /graph/related?query=&repoId=&limit=30(routers/graph.py)
返回(models/graph_models.py):
class NodeSummary(BaseModel):
type: Literal["file","symbol"] # v0.2 仅 "file"
ref: str # e.g. "ref://file/src/a/b.py#L1-L200"
path: Optional[str] = None
lang: Optional[str] = None
score: float
summary: str # 1-2 行:文件角色/用途(v0.2 简写规则:目录名+文件名+lang)
class RelatedResponse(BaseModel):
nodes: list[NodeSummary]
实现:
v0.2:Neo4j FULLTEXT file_text + 朴素关键词匹配;按 path 命中度打分;summary 用规则生成(例如 "{lang} file {path}")。
v0.3 起加入 Symbol、AST、图遍历;v0.4 混合向量检索。
验收:
curl "/graph/related?query=auth token&repoId=xxx" 返回前 30 个相关文件,ref 可直接给 CoPal/MCP 使用。
路由:GET /context/pack?repoId=&stage=plan|review&budget=1500&keywords=...&focus=path1,path2(routers/context.py)
返回(models/context_models.py):
class ContextItem(BaseModel):
kind: Literal["file","symbol","guideline"]
title: str
summary: str
ref: str
extra: dict | None = None
class ContextPack(BaseModel):
items: list[ContextItem]
budget_used: int
实现(v0.2):
根据 keywords 与 focus 先调 graph/related;
只选 前 N 个 file,对每个文件给出:
title = last 2 segments of path
summary = "Likely {lang} module for {parent_dir}"(规则)
ref = ref://file/#L1-L200
budget_used 简单估算(按字符数)。
验收:
curl "/context/pack?repoId=xxx&stage=plan&budget=1000&keywords=auth,token" 返回 ≤预算的 JSON,适合直接贴进 Prompt。
v0.3(代码图谱)——AST 与关系(下周)
文件:services/analysis/ast_python.py
提取:Symbol{name, kind(function|class), signature, startLine, endLine}
关系:(:Symbol)-[:DEFINED_IN]->(:File)、(:Symbol)-[:BELONGS_TO]->(:Symbol)(类成员)
轻量 CALLS:匹配 ast.Call 的 func.id / attr,形成 (:Symbol)-[:CALLS]->(:Symbol)(能解析到的先连,解析不到的留 dangling)
TS/TSX:ast_typescript.py(可用 ts-morph 或 tree-sitter;先做函数/导出名)
扫描 Python import/from import、TS import,形成 (:File)-[:IMPORTS]->(:File)(解析出目标文件或模块名的映射,未解析成功的保留 module 属性)。
路由:GET /graph/impact?repoId=&file=...&depth=2&limit=50
返回:被该文件/符号反向依赖的文件/符号列表(按度与扇出排序,带 ref 与简短 summary)。
验收:
修改一个核心文件路径,impact 能返回上层调用/导入方。
v0.4(混合检索与增量)
在 Neo4j 建立向量索引(OpenAI/text-embedding-3-large 或本地嵌入),为 File.content/Symbol.signature 建 embedding。
ranking/ranker.py 合并 向量相似度(topK)+ fulltext + 简单关键词,学习到统一分数。
services/ingest/git_utils.py:获取 diff --name-sta...
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.