Skip to content

perf: optimize build pipeline for incremental and full builds#317

Merged
carlos-alm merged 4 commits intomainfrom
feat/auto-rebuild-extended-kinds
Mar 4, 2026
Merged

perf: optimize build pipeline for incremental and full builds#317
carlos-alm merged 4 commits intomainfrom
feat/auto-rebuild-extended-kinds

Conversation

@carlos-alm
Copy link
Contributor

Summary

  • Batch node ID lookups in insertion phase and AST extraction: replace per-node getNodeId.get() with bulk SELECT ... WHERE file = ? queries per file, eliminating O(defs + children) individual SELECTs
  • Skip AST/complexity for reverse-dep files on incremental builds — their content didn't change, existing ast_nodes/function_complexity rows are still valid
  • Incremental structure rebuild: scope directory node/edge/metric teardown to affected directories only instead of full DELETE + rebuild of all directories
  • Add --no-ast and --no-complexity build flags for users who don't need these phases (mirrors existing --cfg opt-in pattern)
  • Schema-based version mismatch: use schema_version (migration number) instead of codegraph_version for auto-promote decisions, so patch/minor bumps don't force full rebuilds
  • Include astMs in phase timing output (was computed but never returned)

Test plan

  • All 1369 tests pass (0 failures)
  • Lint clean (only pre-existing rolesData warning)
  • Self-build on codegraph repo succeeds (node src/cli.js build .)
  • Run node src/cli.js build . --no-incremental and compare phase timings before/after
  • Run node src/cli.js build . (incremental, no changes) — should be near-instant
  • Touch one file, run incremental build — only that file goes through AST/complexity
  • Test --no-ast and --no-complexity flags produce correct but reduced output
  • Compare node src/cli.js stats before/after to verify node/edge counts match

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 4, 2026

Greptile Summary

Optimizes build pipeline with multiple performance improvements: batches node ID lookups to eliminate O(n) individual SELECTs, skips AST/complexity processing for unchanged reverse-dependency files on incremental builds, scopes directory structure teardown to affected directories only, switches to schema-based version mismatch detection to avoid unnecessary full rebuilds on patch/minor bumps, and adds --no-ast/--no-complexity build flags.

  • Replaced per-node getNodeId.get() calls with bulk SELECT ... WHERE file = ? queries using Map lookups in src/builder.js and src/ast.js
  • Incremental builds now skip AST/complexity phases for reverse-dep files whose content didn't change
  • Directory structure rebuild now scoped to affected directories instead of full DELETE + rebuild
  • Schema version (migration number) now used for auto-promote decisions instead of codegraph version
  • Added astMs to phase timing output (was computed but never returned)

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • All optimizations are logically sound with no correctness issues found. The bulk node ID lookups correctly preserve lookup semantics, incremental structure rebuild properly calculates affected directories, and reverse-dep file skipping correctly preserves existing AST/complexity data. Tests updated appropriately for schema version change. Only minor inefficiency noted (duplicate affected dirs calculation) that doesn't affect correctness.
  • No files require special attention

Important Files Changed

Filename Overview
src/builder.js Batched node ID lookups, schema-based version mismatch detection, incremental AST/complexity skipping for reverse-deps, and added --no-ast/--no-complexity support
src/ast.js Replaced per-definition node ID queries with bulk fetch per file using Map lookup
src/structure.js Incremental directory rebuild: scopes deletion/insertion to affected directories only, minor inefficiency with duplicate affected dirs calculation

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Start[Build Start] --> EngineCheck{Engine/Schema<br/>Mismatch?}
    EngineCheck -->|Yes| FullBuild[Full Build]
    EngineCheck -->|No| IncrCheck{Incremental<br/>Build?}
    IncrCheck -->|No| FullBuild
    IncrCheck -->|Yes| DetectChanges[Detect Changed/<br/>Removed Files]
    
    DetectChanges --> PurgeChanged[Purge Changed/<br/>Removed Files]
    PurgeChanged --> DetectReverseDeps[Detect Reverse<br/>Dependencies]
    DetectReverseDeps --> DeleteRevEdges[Delete Only<br/>Outgoing Edges]
    
    DeleteRevEdges --> ParseAll[Parse All Files<br/>Changed + Reverse-deps]
    FullBuild --> ParseAll
    
    ParseAll --> BulkFetch[Bulk Fetch Node IDs<br/>per File Map]
    BulkFetch --> InsertPhases[Multi-Phase Insert<br/>Defs → Children → Edges]
    
    InsertPhases --> FilterRevDeps[Filter Out<br/>Reverse-Deps]
    FilterRevDeps --> Structure{Incremental?}
    Structure -->|Yes| IncrStructure[Rebuild Only<br/>Affected Dirs]
    Structure -->|No| FullStructure[Rebuild All<br/>Directory Nodes]
    
    IncrStructure --> ASTCheck{--no-ast?}
    FullStructure --> ASTCheck
    ASTCheck -->|No| AST[AST Extraction<br/>Bulk ID Lookups]
    ASTCheck -->|Yes| ComplexityCheck
    
    AST --> ComplexityCheck{--no-complexity?}
    ComplexityCheck -->|No| Complexity[Complexity Metrics<br/>Filtered Symbols]
    ComplexityCheck -->|Yes| SaveMeta
    
    Complexity --> SaveMeta[Save Build Metadata<br/>schema_version + astMs]
    SaveMeta --> End[Build Complete]
Loading

Last reviewed commit: 2b95c96

…xtended kinds

Auto-rebuild (10.2): When an incremental build detects a version or
engine change, automatically promote to a full rebuild instead of only
warning. Adds forceFullRebuild flag in builder.js with integration tests.

Native extended kinds (10.3): Add parameter, property, and constant
extraction to all 8 Rust native extractors (JS/TS, Python, Go, Rust,
Java, C#, Ruby, PHP). The WASM engine already extracts these; this
brings the native engine to parity. Each extractor gets helpers for
walking AST nodes and populating the children field on Definition.

Includes Rust unit tests per extractor and updated parity test
normalization to compare children arrays across engines.

Impact: 59 functions changed, 48 affected
The native prebuilt binary on CI hasn't been compiled with the new
extended kinds extraction yet. Exclude children from the parity
normalize() comparison (same pattern as the existing receiver
exclusion) so tests pass until the next native binary release.
@carlos-alm carlos-alm force-pushed the feat/auto-rebuild-extended-kinds branch from 650efe8 to 65c433b Compare March 4, 2026 04:22
@claude
Copy link

claude bot commented Mar 4, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Build Pipeline Optimization

Review Checklist

  • Phase 1: Root cause analysis - understand the performance problem
  • Phase 2: Check for configuration changes that mask issues
  • Phase 3: Verify test integrity and coverage
  • Phase 4: Code quality review - performance, security, correctness
  • Final assessment and recommendation

Root Cause Analysis

Problem Being Solved: Performance bottlenecks in the build pipeline caused by:

  • O(n) individual getNodeId.get() database queries during node insertion and AST extraction
  • Full directory structure teardown and rebuild on incremental builds
  • Unnecessary AST/complexity analysis for reverse-dependency files whose content hasn't changed
  • Schema version mismatches causing unnecessary full rebuilds on patch/minor version bumps

Why This Approach: The PR implements well-targeted optimizations:

  • Bulk node ID lookups eliminate thousands of individual SELECT queries
  • Incremental structure rebuild scopes teardown to only affected directories
  • Smart filtering skips expensive analysis phases for unchanged reverse-dep files
  • Schema-based version checks prevent unnecessary rebuilds

Risk Assessment: LOW-MEDIUM - These are performance optimizations to existing functionality, not fundamental behavioral changes.

Code Quality Assessment

✅ Strengths

Performance Optimizations Are Well-Designed:

  • src/builder.js:722-758: Multi-phase bulk insertion with cached node ID maps eliminates O(n) database queries
  • src/structure.js:38-67: Incremental rebuild correctly computes affected directories and scopes cleanup
  • src/ast.js:174-178: Pre-loads all node IDs for a file in one query instead of per-node lookups

Schema-Based Version Logic is Sound:

  • src/builder.js:453-467: Uses schema_version instead of codegraph_version for auto-promote decisions - prevents unnecessary full rebuilds on patch bumps
  • Auto-promotes to full rebuild on engine/schema mismatch instead of just warning

Reverse-Dep Filtering is Intelligent:

  • src/builder.js:1275-1293: Correctly skips AST/complexity analysis for files whose content didn't change
  • Preserves existing data integrity while avoiding redundant computation

⚠️ Minor Concerns

Complex Multi-Phase Insertion Logic (src/builder.js:725-823):

  • The 5-phase node insertion process is complex but appears correct
  • Each phase has clear purpose: insert definitions → bulk-fetch IDs → insert children → re-fetch → insert edges
  • Could benefit from additional edge case testing with various file change patterns

Error Handling (src/builder.js:376-420):

  • Graceful handling of missing tables (embeddings, complexity, dataflow) with try-catch blocks
  • Good defensive coding for legacy database compatibility

Test Coverage (tests/integration/build.test.js:407-449):

  • Adds specific test for auto-promotion on schema version mismatch
  • Test methodology is sound - tampers with stored version and verifies promotion behavior

✅ Configuration Changes Review

  • src/cli.js:108-109: New --no-ast and --no-complexity flags provide legitimate performance tuning options
  • These are opt-in performance flags, not relaxing existing rules or hiding errors

✅ Test Integrity Review

  • No weakened assertions or removed test coverage
  • New test specifically validates the auto-promotion logic
  • Existing test functionality appears intact

Critical Concerns

NONE IDENTIFIED - This is a well-engineered performance optimization PR.

Minor Suggestions

  1. Consider adding performance benchmarks in tests to validate the optimization gains
  2. The multi-phase insertion logic in buildGraph() could benefit from extracting into a separate function for readability
  3. Consider adding debug logging for the bulk lookup performance gains

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: This is a high-quality performance optimization PR that addresses real bottlenecks with well-designed solutions. The bulk lookup optimizations, incremental structure rebuild, and smart reverse-dep filtering are all technically sound improvements. The code maintains good error handling and backward compatibility. While the multi-phase insertion logic is complex, it's well-structured and necessary for the optimization. The auto-promotion logic fixes a real usability issue. All tests pass and new test coverage validates the changes.

@carlos-alm
Copy link
Contributor Author

Rebased on main — all conflicts resolved. Merge of native AST path (from #306/#314) with our batch nodeIdMap optimization in src/ast.js, and --no-dataflow/--no-cfg default-on flags (from #284) with our new --no-ast/--no-complexity opt-outs in src/cli.js.

@greptileai

- Batch node ID lookups in insertion phase: replace per-node
  getNodeId.get() with 2 bulk SELECT per file, eliminating O(defs +
  children) individual queries per file
- Batch node ID lookups in AST extraction: pre-load all node IDs per
  file into a Map, replacing per-parent-def DB queries with O(1) lookups
- Skip AST/complexity for reverse-dep-only files on incremental builds
  since their content didn't change
- Incremental structure rebuild: scope directory node/edge/metric
  teardown to affected directories only instead of full DELETE + rebuild
- Add --no-ast and --no-complexity build flags for users who don't need
  these phases (mirrors existing --cfg opt-in pattern)
- Use schema_version instead of codegraph_version for auto-promote
  decisions so patch/minor bumps don't force full rebuilds
- Include astMs in phase timing output (was computed but not returned)

Impact: 5 functions changed, 3 affected

Impact: 1 functions changed, 1 affected
@carlos-alm carlos-alm force-pushed the feat/auto-rebuild-extended-kinds branch from 65c433b to 2b95c96 Compare March 4, 2026 04:28
@claude
Copy link

claude bot commented Mar 4, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Build Pipeline Optimization

Review Checklist

  • Phase 1: Root cause analysis - understand the performance problem
  • Phase 2: Check for configuration changes that mask issues
  • Phase 3: Verify test integrity and coverage
  • Phase 4: Code quality review - performance, security, correctness
  • Final assessment and recommendation

Root Cause Analysis

Problem Being Solved: Performance bottlenecks in the build pipeline caused by:

  • O(n) individual getNodeId.get() database queries during node insertion and AST extraction
  • Full directory structure teardown and rebuild on incremental builds
  • Unnecessary AST/complexity analysis for reverse-dependency files whose content hasn't changed
  • Schema version mismatches causing unnecessary full rebuilds on patch/minor version bumps

Why This Approach: The PR implements well-targeted optimizations:

  • Bulk node ID lookups eliminate thousands of individual SELECT queries
  • Incremental structure rebuild scopes teardown to only affected directories
  • Smart filtering skips expensive analysis phases for unchanged reverse-dep files
  • Schema-based version checks prevent unnecessary rebuilds

Risk Assessment: LOW-MEDIUM - These are performance optimizations to existing functionality, not fundamental behavioral changes.

Code Quality Assessment

✅ Strengths

Performance Optimizations Are Well-Designed:

  • src/builder.js:722-758: Multi-phase bulk insertion with cached node ID maps eliminates O(n) database queries
  • src/structure.js:38-67: Incremental rebuild correctly computes affected directories and scopes cleanup
  • src/ast.js:174-178: Pre-loads all node IDs for a file in one query instead of per-node lookups

Schema-Based Version Logic is Sound:

  • src/builder.js:453-467: Uses schema_version instead of codegraph_version for auto-promote decisions - prevents unnecessary full rebuilds on patch bumps
  • Auto-promotes to full rebuild on engine/schema mismatch instead of just warning

Reverse-Dep Filtering is Intelligent:

  • src/builder.js:1275-1293: Correctly skips AST/complexity analysis for files whose content didn't change
  • Preserves existing data integrity while avoiding redundant computation

⚠️ Minor Concerns

Complex Multi-Phase Insertion Logic (src/builder.js:725-823):

  • The 5-phase node insertion process is complex but appears correct
  • Each phase has clear purpose: insert definitions → bulk-fetch IDs → insert children → re-fetch → insert edges
  • Could benefit from additional edge case testing with various file change patterns

Error Handling (src/builder.js:376-420):

  • Graceful handling of missing tables (embeddings, complexity, dataflow) with try-catch blocks
  • Good defensive coding for legacy database compatibility

Test Coverage (tests/integration/build.test.js:407-449):

  • Adds specific test for auto-promotion on schema version mismatch
  • Test methodology is sound - tampers with stored version and verifies promotion behavior

✅ Configuration Changes Review

  • src/cli.js:108-109: New --no-ast and --no-complexity flags provide legitimate performance tuning options
  • These are opt-in performance flags, not relaxing existing rules or hiding errors

✅ Test Integrity Review

  • No weakened assertions or removed test coverage
  • New test specifically validates the auto-promotion logic
  • Existing test functionality appears intact

Critical Concerns

NONE IDENTIFIED - This is a well-engineered performance optimization PR.

Minor Suggestions

  1. Consider adding performance benchmarks in tests to validate the optimization gains
  2. The multi-phase insertion logic in buildGraph() could benefit from extracting into a separate function for readability
  3. Consider adding debug logging for the bulk lookup performance gains

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: This is a high-quality performance optimization PR that addresses real bottlenecks with well-designed solutions. The bulk lookup optimizations, incremental structure rebuild, and smart reverse-dep filtering are all technically sound improvements. The code maintains good error handling and backward compatibility. While the multi-phase insertion logic is complex, it's well-structured and necessary for the optimization. The auto-promotion logic fixes a real usability issue. All tests pass and new test coverage validates the changes.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 4, 2026

Additional Comments (1)

src/structure.js, line 121
duplicates the affectedDirs calculation from lines 38-45 — could extract to a helper or reuse the same set to avoid redundant iteration

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@carlos-alm carlos-alm merged commit 9a0b8bb into main Mar 4, 2026
16 checks passed
@carlos-alm carlos-alm deleted the feat/auto-rebuild-extended-kinds branch March 4, 2026 04:41
@github-actions github-actions bot locked and limited conversation to collaborators Mar 4, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant