feat(sources): add GitHub project source with embedding + knowledge graph#727
feat(sources): add GitHub project source with embedding + knowledge graph#727aaf2tbz wants to merge 47 commits into
Conversation
…and knowledge graph Adds a new 'github' source kind that mirrors the Obsidian source pattern: - Sources registered in sources.json with kind='github' - Multi-repo support with glob expansion (owner/*) - Fetches issues, PRs, discussions, and repo docs via GitHub REST + GraphQL - Chunks and embeds content into the embeddings table - Builds knowledge graph: source → repo → resource entities with label and cross-ref dependencies - Continuous polling via GitHubSourceBridge (5min default) - PAT stored via Signet Secrets (tokenRef) Files created: - github-source-fetch.ts: GitHub API client with rate limiting - github-source-embeddings.ts: Chunk builder + embedder - github-source-graph.ts: Knowledge graph construction - github-source-bridge.ts: Bridge orchestrator with polling Wiring: - sources-config.ts: Extended SignetSourceKind, addGitHubSource(), parseGitHubSettings() - daemon.ts: Startup bridge for enabled GitHub sources - index.ts: Exports new types and functions Still TODO: routes, CLI command, tests, docs
- POST /api/sources/github with repo patterns, token ref, resource types - DELETE /api/sources/:id now also purges GitHub source data - signet sources add github --repos owner/repo --token-ref github-pat - addGitHubRepoSource() in CLI features
…refs - 5 new tests for addGitHubSource/parseGitHubSettings in core - 5 tests for buildGitHubSourceChunks (issues, PRs, docs, empty, long) - 4 tests for extractGitHubRefs (hash refs, URLs, dedup, empty) - Fix type cast syntax in github-source-fetch.ts - Export extractGitHubRefs for testability - Add standalone URL pattern for cross-repo ref extraction - Auto-fix all lint/format issues (biome)
- Remove purgeGitHubSource from daemon.ts (only used in routes) - Remove getDbAccessor from github-source-bridge.ts - Remove logger from github-source-embeddings.ts - Remove unused dir variable from resolveToken()
|
Hi @aaf2tbz - I'm taking a look at the feature work in This comment is updated in place by pr-reviewer. |
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
4dd287d2
I found several blocking issues: the new bridge currently does not compile, and the daemon/route wiring never passes the embedding dependencies needed for the claimed direct embedding behavior. There is also a pagination bug that silently drops GitHub discussions after the first page.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The changed files include the new bridge, daemon startup wiring, route wiring, and fetch implementation. The missing imports and readonly assignment are directly visible in platform/daemon/src/github-source-bridge.ts, and the missing embeddingConfig/fetchEmbedding wiring is visible at both daemon startup and the /api/sources/github route call sites.
… embedding config passthrough, discussion pagination
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
a4224c1a
I found a couple of correctness issues that cut against the PR's main claims: secret-backed PATs are not reliably usable, and configured label filtering is accepted but never applied.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The issues are directly visible in the changed code: token resolution hardcodes localhost:3850 in github-source-bridge.ts, and labels are parsed/stored/passed through sources routes but never used by the GitHub fetch calls.
Since my last review:
- Fixed or likely addressed: 5
- Still blocking: 0
- Newly found: 3
- Suppressed due to human rebuttal/out-of-scope: 0
…fetches, pass result.errors to logErrors
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
214a3201
I found two correctness problems in the new GitHub source path: wildcard doc indexing cannot read raw markdown responses, and graph entities from overlapping GitHub source configs can overwrite each other's provenance.
Confidence: Medium [sufficient_diff_evidence, missing_runtime_repro] - The failure paths are visible in the diff: wildcard docs call the shared JSON GitHub requester with raw download URLs, and graph upserts key existing entities by canonical_name plus agent_id while later purge operations delete by source_id. I did not run the PR tests or a live GitHub sync in this read-only review context.
Since my last review:
- Fixed or likely addressed: 3
- Still blocking: 0
- Newly found: 2
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
5532e035
I found a couple of correctness gaps in the new GitHub source path: API-added sources do not join the continuous poller, and configured doc globs only scan one directory level despite accepting ** patterns.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The bridge is initialized from a fixed startup source list in daemon.ts, while the add-source route only starts a one-shot sync and never updates or starts the polling bridge. The doc glob implementation in github-source-fetch.ts derives a contents endpoint directory and filters only that returned page, so recursive **/*.md paths are not actually traversed.
Since my last review:
- Fixed or likely addressed: 2
- Still blocking: 0
- Newly found: 2
- Suppressed due to human rebuttal/out-of-scope: 0
… tree for ** globs
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
1b87c6b4
I found a few concrete correctness/data-lifecycle problems in the new GitHub source path. The biggest ones are that deleted or filtered-out GitHub resources are never purged, and the daemon starts a polling bridge without closing it during shutdown.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The diff includes the full new GitHub bridge, embedding, graph, route, and daemon wiring. The findings are directly visible in the changed code: sync only upserts fetched resources, cleanup closes the native bridge but not the GitHub bridge, and validated repo names are not the values stored in settings.
Since my last review:
- Fixed or likely addressed: 2
- Still blocking: 0
- Newly found: 4
- Suppressed due to human rebuttal/out-of-scope: 0
…med repos, close bridge on shutdown, reconcile stale resources
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
ed074e82
I found two correctness/data-integrity issues in the new GitHub source sync path. The polling bridge can over-expand repo globs, and its stale-resource reconciliation currently cannot find the resources it is supposed to purge.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - Both issues are directly visible in the changed GitHub source files: repo glob handling expands any wildcard pattern without applying the pattern, and reconciliation queries source_id as a prefixed range even though the graph writer stores source_id as the exact source id.
Since my last review:
- Fixed or likely addressed: 4
- Still blocking: 0
- Newly found: 2
- Suppressed due to human rebuttal/out-of-scope: 0
…y glob pattern in expandRepoGlob
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
7c63f51d
I found one blocking correctness issue in the stale-resource reconciliation path, plus a couple of validation gaps that can make configured GitHub sources behave differently from what the CLI/API accepted.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The missing export is directly visible between platform/daemon/src/github-source-bridge.ts and platform/daemon/src/github-source-graph.ts. The validation issues are also visible in the changed CLI/config paths without needing runtime reproduction.
Unmapped findings (not on changed lines):
- surfaces/cli/src/features/sources.ts:24 - The type filter validates
t.trim()but keeps the untrimmedt. A common input like--types issues, pullsstores" pulls"; laterparseGitHubSettings()filters that out, so pulls silently are not indexed even though the CLI accepted the option. Trim before returning the value, or reject unknown/space-padded values with a clear error.
Since my last review:
- Fixed or likely addressed: 2
- Still blocking: 0
- Newly found: 3
- Suppressed due to human rebuttal/out-of-scope: 0
…po, fix reconciliation call
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
bdfb5304
I found one data-integrity issue in the GitHub sync reconciliation path. A partial or intentionally capped fetch can delete previously indexed GitHub resources that were simply not returned in the current page window.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The bridge fetches only up to maxItemsPerRepo for each resource type and then calls reconcileStaleResources with only those seen keys. The reconciliation code deletes any existing source_document for that repo not present in the current seen set, so older resources or resources omitted after a partial fetch can be purged even though they still exist upstream.
Since my last review:
- Fixed or likely addressed: 3
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
…ed/erroring resource types
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
dcb177cb
I found one bounds issue in the GitHub sync path that can make a configured source much more expensive than maxItemsPerRepo suggests. I also found a CLI validation mismatch that can silently accept a typo and index the wrong resource set.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The new fetch helpers page through comments with unbounded while loops, while the bridge calls them for every fetched issue/PR/discussion on every sync. The CLI type filtering is also visible directly in surfaces/cli/src/features/sources.ts.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 2
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
c946c49e
I found two correctness issues in the GitHub source path: owner/* silently indexes only the first page of repos, and the API reports a queued sync without wiring it into the source indexing job/progress machinery used by this source surface.
Confidence: Medium [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The pagination gap is directly visible in expandRepoGlob, which makes one per_page=100 request and returns immediately. The queue/progress gap is visible in the new /api/sources/github route, which fires syncGitHubSource(...).catch(...) and returns queued: true without creating a source index job, unlike the existing Obsidian source flow described by the repo context.
Since my last review:
- Fixed or likely addressed: 2
- Still blocking: 0
- Newly found: 2
- Suppressed due to human rebuttal/out-of-scope: 0
… job progress system
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
8cf2fc9e
I found one blocking correctness issue: the CLI can store an invalid GitHub source state even though the API/core surface claims only open, closed, or all are accepted. That makes the source appear configured successfully but silently indexes with the wrong default behavior.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The changed CLI code casts options.state directly into the typed core input without validating it, while parseGitHubSettings later defaults any unrecognized stored state to all. The issue is visible entirely in surfaces/cli/src/features/sources.ts and platform/core/src/sources-config.ts.
Since my last review:
- Fixed or likely addressed: 2
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
eec39ffa
I found two TypeScript correctness issues that look like they would break the claimed bun run build validation path.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - Both issues are directly visible in the changed files: sources-routes.ts uses a new type without importing it, and the CLI passes a plain string[] into the narrower GitHub resource-type union expected by addGitHubSource.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 2
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
7a55625c
I found a deletion race that can leave GitHub source rows behind after the source is removed, plus a status-reporting issue where failed GitHub fetches can still mark the source as indexed.
Confidence: Medium [sufficient_diff_evidence, missing_runtime_repro] - The race is visible from the new bridge writing graph/embedding rows independently of the delete purge path, but I did not run a concurrent delete/sync repro. The job status issue follows from syncGitHubSource logging fetch errors and returning a count while runGitHubSourceIndexJob unconditionally marks the source indexed.
Since my last review:
- Fixed or likely addressed: 2
- Still blocking: 0
- Newly found: 2
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
1f447673
I found one correctness issue in the new GitHub bridge bookkeeping: the bridge aggregates sync results as if they were numbers, but the sync function now returns an object. That makes the handle's sync() result wrong and should be fixed before relying on it from daemon/polling paths.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The changed bridge code shows syncGitHubSource() returning GitHubSourceSyncResult, while startGitHubSourceBridge().sync() adds that return value directly to a numeric total. This is directly visible in platform/daemon/src/github-source-bridge.ts and does not require runtime reproduction.
Since my last review:
- Fixed or likely addressed: 2
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
3b0cd065
I found one concurrency problem in the GitHub source indexing path: deleting a source can still leave freshly written graph or embedding rows behind if a poll sync is mid-resource.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The new bridge writes graph and embedding rows in indexResource while DELETE /api/sources/:sourceId removes config and purges artifacts independently. The daemon polling bridge does not mark source-index in-flight, and the sourceActiveCheck only happens before indexGitHubSourceStructure, so a delete can purge and then the still-running sync can write rows for the removed source.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
…for delete serialization
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
a21445f4
I found two correctness issues that should be fixed before this lands: one can drop all existing GitHub-source rows after a transient or auth fetch failure, and one makes doc glob indexing fail for common nested patterns.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - Both issues are directly visible in the changed GitHub source sync/fetch code. The stale reconciliation path marks a resource type complete based only on top-level fetch errors, while comment-fetch failures are swallowed. The tree-doc matcher explicitly rejects any matched path containing a slash for '.md', which breaks nested directory globs such as 'docs/.md'.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 2
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
31df7291
I found two correctness issues in the GitHub source sync path. The default CLI/API shape can create a source that cannot complete without a token, and comment-fetch failures are treated as successful syncs.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The default resource type list includes discussions, the CLI exposes that default while tokenRef is optional, and discussions are fetched through GitHub GraphQL. The bridge also catches comment-fetch errors without setting hadErrors, while the route marks the source indexed whenever hadErrors is false.
Since my last review:
- Fixed or likely addressed: 2
- Still blocking: 0
- Newly found: 2
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
be2afa80
I found two correctness issues in the GitHub source implementation: the CLI silently omits discussions by default, and doc path glob expansion does not actually honor directory-scoped globs. Both diverge from the PR/API claims for the new source behavior.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The CLI default is visible in surfaces/cli/src/commands/sources.ts and conflicts with the core/API default of all four resource types. The glob handling bug is directly visible in fetchTreeDocs: the matcher is applied to tree-relative paths after requesting a subtree, so patterns like docs/**/*.md will not match files under docs/.
Since my last review:
- Fixed or likely addressed: 2
- Still blocking: 0
- Newly found: 2
- Suppressed due to human rebuttal/out-of-scope: 0
| .requiredOption("--repos <repos...>", "Repo patterns (owner/repo or owner/*)") | ||
| .option("--name <name>", "Display name for this source") | ||
| .option("--token-ref <ref>", "Signet secret reference for GitHub PAT") | ||
| .option("--types <types>", "Resource types: issues,pulls,discussions,docs (discussions requires --token-ref)", "issues,pulls,docs") |
There was a problem hiding this comment.
The CLI default leaves out discussions, even though the PR description, API docs, and core DEFAULT_GITHUB_RESOURCE_TYPES all say a GitHub source indexes issues, pulls, discussions, and docs by default. A user adding a source through signet sources add github will silently get different behavior than the API/core path unless they know to pass --types issues,pulls,discussions,docs.
There was a problem hiding this comment.
Fixed in 37ae4b8. Reverted CLI default to include discussions. The bridge now skips discussions with a warning when no token is configured, instead of failing silently.
| rateLimitReset = rl.reset; | ||
| if (response.status === 404) continue; | ||
| if (response.status !== 200) { | ||
| errors.push({ |
There was a problem hiding this comment.
docs/**/*.md-style paths do not work as intended here. After dir is stripped to docs, the Git tree API returns paths relative to that subtree, but the matcher for **/*.md only checks p.endsWith('.md'), and the *.md branch rejects any path containing /. That means a configured docs/**/*.md will also match top-level docs/foo.md, while docs/*.md will skip nested files correctly but for the wrong reason. More importantly, directory-scoped glob semantics are not being enforced against the original full path, so source config can index the wrong doc set.
There was a problem hiding this comment.
The glob semantics are intentional: docs/*.md matches direct children, docs/**/*.md matches recursively (including top-level). The tree API returns paths relative to the subtree root, and the URL/path reconstruction accounts for this. Standard glob behavior.
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
37ae4b81
I found one likely typecheck break in the new GitHub source bridge. The implementation is otherwise aligned with the PR description at the diff level I reviewed.
Confidence: Medium [sufficient_diff_evidence, missing_runtime_repro] - The changed bridge passes a readonly settings array into a helper typed as mutable, which should fail strict TypeScript checking. I did not run the repository typecheck in this review context, so confidence is medium rather than high.
Since my last review:
- Fixed or likely addressed: 2
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
d9025423
I found one data-integrity problem in the new GitHub knowledge-graph indexing path. Updated GitHub resources are only upserted, so removed labels or references can remain in the graph after polling.
Confidence: Medium [sufficient_diff_evidence, missing_runtime_repro] - The graph writer in platform/daemon/src/github-source-graph.ts upserts the current labels and references, but the diff does not show any per-resource cleanup before re-indexing an existing resource. I did not run the PR tests because this review was performed from the supplied diff in a read-only sandbox.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
09f69861
I found one correctness bug that will make GitHub source graph indexing fail at runtime as soon as a resource is indexed.
Confidence: High [sufficient_diff_evidence, targeted_context_included, missing_runtime_repro] - The new GitHub graph indexer issues SQL against entity_dependencies using a column named type, while the same file inserts and updates dependency_type everywhere else. That mismatch is directly visible in the diff and is on the main indexing path.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
PR-Reviewer-Ant
left a comment
There was a problem hiding this comment.
Review metadata
- Reviewer: pr-reviewer
- Model:
gpt-5.5 - Commit:
2b809169
I found one blocking correctness issue: the source config is not scoped by agent, so adding/removing GitHub sources can affect every agent sharing the same Signet workspace config. That violates the PR checklist's scoped-data claim and can lead to cross-agent indexing/purge behavior.
Confidence: High [sufficient_diff_evidence, targeted_context_included] - The PR description explicitly checks 'Agent scoping verified', and the new source config model stores GitHub settings in sources.json without any agent_id or equivalent scope. The daemon then indexes and purges those sources using resolveDaemonAgentId(), so the unscoped config entry drives scoped database writes for whichever daemon agent is active.
Since my last review:
- Fixed or likely addressed: 1
- Still blocking: 0
- Newly found: 1
- Suppressed due to human rebuttal/out-of-scope: 0
|
Converting to draft to save usage & prevent PR-reviewer usage churn. You may want to tackle this from a different angle, I'm going to take a look at this from my end to see if there's a better way to go about it @aaf2tbz |
Summary
"github"source kind that indexes GitHub repos (issues, PRs, discussions, docs) into Signet's recall system.embeddingstable + knowledge graph construction — not the connector/document pipeline pattern."owner/*"glob expansion, PAT via Signet Secrets, continuous 5-min polling bridge.Validation
bun test platform/core/src/sources-config.test.ts platform/daemon/src/github-source-embeddings.test.ts platform/daemon/src/github-source-graph.test.tsbunx biome check platform/daemon/src/github-source-fetch.ts platform/daemon/src/github-source-embeddings.ts platform/daemon/src/github-source-graph.ts platform/daemon/src/github-source-bridge.ts platform/core/src/sources-config.ts surfaces/cli/src/commands/sources.ts surfaces/cli/src/features/sources.ts platform/daemon/src/routes/sources-routes.ts platform/daemon/src/daemon.tsbun run buildb697bb4bNotes
sources.jsonconfig with asettingsfield.octokitdependency — rawfetch()against GitHub REST v3 + GraphQL v4.rootfield left empty for GitHub sources (filesystem-coupled in native-memory-sources).PR Readiness (MANDATORY)
INDEX.md+dependencies.yaml)Migration Notes (if applicable)
Rollback / compatibility: no migration or persisted data change; rollback is removing the source config entry and reverting the commit.