Skip to content

feat: extend dataflow analysis to all supported languages#318

Merged
carlos-alm merged 5 commits intomainfrom
feat/dataflow-multi-language
Mar 4, 2026
Merged

feat: extend dataflow analysis to all supported languages#318
carlos-alm merged 5 commits intomainfrom
feat/dataflow-multi-language

Conversation

@carlos-alm
Copy link
Contributor

Summary

  • Extends dataflow analysis from JS/TS-only to all supported languages: Python, Go, Rust, Java, C#, PHP, Ruby
  • Follows the established CFG/Complexity pattern: DATAFLOW_DEFAULTS + makeDataflowRules() factory + per-language rule objects + DATAFLOW_RULES Map + DATAFLOW_EXTENSIONS Set
  • Refactors extractDataflow() to accept langId and use rules throughout; all AST helpers (functionName, extractParams, resolveCalleeName, memberReceiver) parameterized by rules
  • Refactors buildDataflowEdges() to use DATAFLOW_EXTENSIONS instead of hardcoded extension checks
  • Language-specific handling: Go expression_list unwrapping, C# direct-child initializer scanning, Java combined call+member mutations, PHP extra identifier types and argument wrappers
  • 7 new per-language test files (57 total tests across 8 languages), zero regressions

Test plan

  • npx vitest run tests/parsers/dataflow- — all 57 tests pass across 8 language files
  • npx vitest run tests/integration/dataflow.test.js — integration tests pass
  • npm test — full suite green (1404 passed)
  • node src/cli.js build . --dataflow — builds without errors, 259 edges inserted
  • node src/cli.js dataflow buildGraph -T — shows dataflow edges correctly

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 4, 2026

Greptile Summary

Extended dataflow analysis from JavaScript/TypeScript to all 8 supported languages (Python, Go, Rust, Java, C#, PHP, Ruby). Implemented a comprehensive rule-based system following the CFG/Complexity pattern with DATAFLOW_DEFAULTS, makeDataflowRules() factory, and per-language rule objects. Refactored core functions (extractDataflow, functionName, extractParams, resolveCalleeName, memberReceiver) to accept langId and use language-specific rules throughout.

Key changes:

  • Added 7 new language-specific dataflow rule objects with custom AST node mappings
  • Handled language-specific edge cases: Go expression_list unwrapping, C# equals_value_clause initializers, Java combined call+member mutations, PHP variable_name identifiers
  • Built DATAFLOW_EXTENSIONS set from LANGUAGE_REGISTRY to replace hardcoded extension checks
  • Added 57 tests across 8 language files with consistent coverage of parameters, returns, assignments, argFlows, and mutations

Issue found:

  • collectIdentifiers function hardcodes 'identifier' type check, missing PHP's 'variable_name' and other language-specific identifier types, causing incomplete referencedNames in return statements

Confidence Score: 4/5

  • Safe to merge with one non-critical logic bug that affects completeness but not correctness
  • The implementation is well-structured, follows established patterns, and includes comprehensive test coverage. The identified bug in collectIdentifiers affects data completeness for PHP (and potentially other languages) but doesn't cause crashes or incorrect edge creation - it only results in incomplete referencedNames arrays. All 1404 tests pass. The fix is straightforward.
  • src/dataflow.js requires fixing the collectIdentifiers function to handle language-specific identifier types

Important Files Changed

Filename Overview
src/dataflow.js Extended dataflow analysis from JS/TS to all 8 languages with comprehensive rule system; one logic bug in collectIdentifiers missing language-specific identifier types
tests/parsers/dataflow-python.test.js Well-structured test suite for Python dataflow extraction covering parameters, returns, assignments, argFlows, and mutations
tests/parsers/dataflow-go.test.js Comprehensive Go dataflow tests including multi-name parameters and short var declarations
tests/parsers/dataflow-java.test.js Java dataflow tests covering method invocations, parameters, and collection mutations
tests/parsers/dataflow-csharp.test.js C# dataflow test suite with proper coverage of PascalCase method names and var declarations
tests/parsers/dataflow-php.test.js PHP dataflow tests for variable syntax with $ prefix; missing referencedNames assertion in returns test

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Start[extractDataflow called with langId] --> GetRules[Lookup rules from DATAFLOW_RULES Map]
    GetRules --> CheckRules{Rules exist for language?}
    CheckRules -->|No| ReturnEmpty[Return empty arrays]
    CheckRules -->|Yes| InitState[Initialize parameters, returns, assignments, argFlows, mutations arrays]
    InitState --> Traverse[Walk AST with visit function]
    Traverse --> CheckNode{Node type?}
    CheckNode -->|Function node| EnterScope[enterScope: extract function name & params using rules]
    EnterScope --> VisitChildren1[Visit children]
    VisitChildren1 --> ExitScope[exitScope]
    CheckNode -->|Return node| CollectRefs[collectIdentifiers for referencedNames]
    CollectRefs --> RecordReturn[Push to returns array]
    CheckNode -->|Variable declarator| HandleVar[handleVarDeclarator: extract assignments from calls]
    HandleVar --> VisitChildren2[Visit children]
    CheckNode -->|Call node| HandleCall[handleCallExpr: track argFlows]
    HandleCall --> VisitChildren3[Visit children]
    CheckNode -->|Assignment node| HandleAssign[handleAssignment: detect mutations & captures]
    HandleAssign --> VisitChildren4[Visit children]
    CheckNode -->|Expression statement| CheckMutation[handleExprStmtMutation: detect mutating methods]
    CheckMutation --> VisitChildren5[Visit children]
    CheckNode -->|Other| DefaultVisit[Visit all children]
    ExitScope --> MoreNodes{More nodes?}
    RecordReturn --> MoreNodes
    VisitChildren2 --> MoreNodes
    VisitChildren3 --> MoreNodes
    VisitChildren4 --> MoreNodes
    VisitChildren5 --> MoreNodes
    DefaultVisit --> MoreNodes
    MoreNodes -->|Yes| CheckNode
    MoreNodes -->|No| ReturnData[Return extracted dataflow data]
    ReturnEmpty --> End[End]
    ReturnData --> End
Loading

Last reviewed commit: 40a2b01

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 4, 2026

Additional Comments (1)

src/dataflow.js, line 989
collectIdentifiers only checks for 'identifier' type, missing language-specific identifier nodes (e.g., PHP's 'variable_name'). This causes incomplete referencedNames in return statements for non-JS/TS languages. Should use isIdent(node.type, rules) instead.

function collectIdentifiers(node, out, rules) {
  if (isIdent(node.type, rules)) {
    out.push(node.text);
    return;
  }
  for (const child of node.namedChildren) {
    collectIdentifiers(child, out, rules);
  }
}

And update call site at line 914:

if (expr) collectIdentifiers(expr, referencedNames, rules);

Dataflow extraction was limited to JS/TS/TSX. This adds rules-based
support for Python, Go, Rust, Java, C#, PHP, and Ruby following the
established CFG/Complexity pattern.

- DATAFLOW_DEFAULTS + makeDataflowRules() validation factory
- Per-language rule objects mapping AST node types for functions,
  calls, returns, parameters, member access, and mutations
- DATAFLOW_RULES Map + DATAFLOW_EXTENSIONS Set from LANGUAGE_REGISTRY
- extractDataflow() accepts langId, all helpers use rules
- buildDataflowEdges() uses DATAFLOW_EXTENSIONS instead of hardcoded
  extension checks, resolves langId from symbols._langId or extToLang
- Language-specific handling: Go expression_list unwrapping, C# direct
  child initializer scanning, Java combined call+member mutations,
  PHP extra identifier types and argument wrappers
- 7 new per-language test files (57 total tests across 8 languages)

Impact: 25 functions changed, 14 affected
@carlos-alm carlos-alm force-pushed the feat/dataflow-multi-language branch from 40a2b01 to 851f060 Compare March 4, 2026 04:30
@claude
Copy link

claude bot commented Mar 4, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Dataflow Multi-Language Support

Reviewing this pull request that extends dataflow analysis to all supported languages.

Root Cause Analysis

Problem Being Solved: Dataflow analysis was previously limited to JavaScript/TypeScript only. This PR extends it to all 8 supported languages (Python, Go, Rust, Java, C#, PHP, Ruby), enabling complete dataflow tracking across multi-language codebases.

Why This Approach: The implementation follows the established pattern used in CFG/Complexity modules - using a DATAFLOW_DEFAULTS base + makeDataflowRules() factory + per-language rule objects + centralized DATAFLOW_RULES Map. This is architecturally sound and consistent with existing patterns.

Risk Assessment: Low risk - this is a feature extension following established patterns, not a fundamental architectural change.

Critical Concerns

1. Language-Specific Identifier Bug (Line 981-988) 🐛

The collectIdentifiers function hardcodes checking for 'identifier' type, missing language-specific identifier nodes:

function collectIdentifiers(node, out) {
  if (node.type === 'identifier') {  // ❌ Missing PHP's 'variable_name', etc.
    out.push(node.text);
    return;
  }
  for (const child of node.namedChildren) {
    collectIdentifiers(child, out);
  }
}

Impact: This causes incomplete referencedNames arrays in return statements for non-JS/TS languages. PHP variables like $x won't be captured, affecting dataflow completeness.

Evidence: PHP tests at line 36 don't assert referencedNames like other language tests do, suggesting this issue exists but tests don't validate it.

Fix Required: Use the existing isIdent(nodeType, rules) helper and pass rules parameter:

function collectIdentifiers(node, out, rules) {
  if (isIdent(node.type, rules)) {
    out.push(node.text);
    return;
  }
  // ... rest unchanged
}

2. Missing Test Coverage for PHP referencedNames

PHP dataflow tests don't validate referencedNames extraction in return statements, unlike all other languages. This masks the identifier bug above.

Configuration Changes ✅

No risky configuration changes detected:

  • No biome/linting rules modified
  • No test timeouts or coverage thresholds lowered
  • No build configuration changes
  • README legitimately updated to document new multi-language support

Test Integrity ✅

Strong test coverage: 57 tests across 8 language files covering parameters, returns, assignments, argFlows, and mutations. Tests are well-structured and follow consistent patterns. The missing PHP referencedNames test is a gap but doesn't weaken existing assertions.

Code Quality

Strengths:

  • Excellent adherence to established architecture patterns
  • Comprehensive language-specific handling for edge cases (Go expression_list, C# equals_value_clause, Java combined call+member, PHP identifier types)
  • Clean separation of concerns with rule-based configuration
  • Good use of DATAFLOW_EXTENSIONS Set derived from LANGUAGE_REGISTRY

Minor Issues:

  • Function could benefit from more error handling in edge cases
  • Some complex conditional logic could be simplified

Final Recommendation

  • Rating: ⭐⭐⭐☆☆ (3/5)
  • Action: APPROVE WITH CONDITIONS
  • Reasoning: This is a well-architected feature extension that follows established patterns and includes comprehensive test coverage. The identifier bug affects data completeness but not correctness - all tests pass and functionality works. The fix is straightforward and non-breaking.

Conditions for Approval:

  1. Fix collectIdentifiers function to use isIdent(node.type, rules) instead of hardcoded 'identifier' check
  2. Add referencedNames assertions to PHP tests to match other languages
  3. Both fixes can be done in a follow-up PR if time-sensitive

Merge Safety: Safe to merge as-is since the bug only affects completeness of referencedNames arrays, not core dataflow edge creation. All 1404 tests pass.


@claude
Copy link

claude bot commented Mar 4, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Dataflow Multi-Language Support

Reviewing PR #318 that extends dataflow analysis from JS/TS-only to all supported languages.

Review Progress:

  • Phase 1: Root Cause Analysis - Understand problem and approach justification
  • Phase 2: Configuration Change Detection - Check for high-risk config changes
  • Phase 3: Test Integrity - Verify test quality and coverage
  • Phase 4: Code Quality - Analyze implementation and identify issues
  • Final Assessment: Provide rating and recommendation

Root Cause Analysis

Problem Being Solved: Dataflow analysis was previously limited to JavaScript/TypeScript only. This PR extends it to all 8 supported languages (Python, Go, Rust, Java, C#, PHP, Ruby), enabling complete dataflow tracking across multi-language codebases.

Why This Approach: The implementation follows the established pattern used in CFG/Complexity modules - using a DATAFLOW_DEFAULTS base + makeDataflowRules() factory + per-language rule objects + centralized DATAFLOW_RULES Map. This is architecturally sound and consistent with existing patterns.

Risk Assessment: Low risk - this is a feature extension following established patterns, not a fundamental architectural change.

Critical Concerns

1. Language-Specific Identifier Bug (Line 981-988) 🐛

The collectIdentifiers function hardcodes checking for 'identifier' type, missing language-specific identifier nodes:

function collectIdentifiers(node, out) {
  if (node.type === 'identifier') {  // ❌ Missing PHP's 'variable_name', etc.
    out.push(node.text);
    return;
  }
  for (const child of node.namedChildren) {
    collectIdentifiers(child, out);
  }
}

Impact: This causes incomplete referencedNames arrays in return statements for non-JS/TS languages. PHP variables like $x won't be captured, affecting dataflow completeness.

Evidence: PHP tests at line 36 don't assert referencedNames like other language tests do, suggesting this issue exists but tests don't validate it.

Fix Required: Use the existing isIdent(nodeType, rules) helper and pass rules parameter:

function collectIdentifiers(node, out, rules) {
  if (isIdent(node.type, rules)) {
    out.push(node.text);
    return;
  }
  // ... rest unchanged
}

2. Missing Test Coverage for PHP referencedNames

PHP dataflow tests don't validate referencedNames extraction in return statements, unlike all other languages. This masks the identifier bug above.

Configuration Changes ✅

No risky configuration changes detected:

  • No biome/linting rules modified
  • No test timeouts or coverage thresholds lowered
  • No build configuration changes
  • README legitimately updated to document new multi-language support

Test Integrity ✅

Strong test coverage: 57 tests across 8 language files covering parameters, returns, assignments, argFlows, and mutations. Tests are well-structured and follow consistent patterns. The missing PHP referencedNames test is a gap but doesn't weaken existing assertions.

Code Quality

Strengths:

  • Excellent adherence to established architecture patterns
  • Comprehensive language-specific handling for edge cases (Go expression_list, C# equals_value_clause, Java combined call+member, PHP identifier types)
  • Clean separation of concerns with rule-based configuration
  • Good use of DATAFLOW_EXTENSIONS Set derived from LANGUAGE_REGISTRY

Minor Issues:

  • Function could benefit from more error handling in edge cases
  • Some complex conditional logic could be simplified

Final Recommendation

  • Rating: ⭐⭐⭐☆☆ (3/5)
  • Action: APPROVE WITH CONDITIONS
  • Reasoning: This is a well-architected feature extension that follows established patterns and includes comprehensive test coverage. The identifier bug affects data completeness but not correctness - all tests pass and functionality works. The fix is straightforward and non-breaking.

Conditions for Approval:

  1. Fix collectIdentifiers function to use isIdent(node.type, rules) instead of hardcoded 'identifier' check
  2. Add referencedNames assertions to PHP tests to match other languages
  3. Both fixes can be done in a follow-up PR if time-sensitive

Merge Safety: Safe to merge as-is since the bug only affects completeness of referencedNames arrays, not core dataflow edge creation. All 1404 tests pass.


@claude
Copy link

claude bot commented Mar 4, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Dataflow Multi-Language Support

Reviewing PR #318 that extends dataflow analysis from JS/TS-only to all supported languages.

Review Progress:

  • Phase 1: Root Cause Analysis - Understand problem and approach justification
  • Phase 2: Configuration Change Detection - Check for high-risk config changes
  • Phase 3: Test Integrity - Verify test quality and coverage
  • Phase 4: Code Quality - Analyze implementation and identify issues
  • Final Assessment: Provide rating and recommendation

Root Cause Analysis

Problem Being Solved: Dataflow analysis was previously limited to JavaScript/TypeScript only. This PR extends it to all 8 supported languages (Python, Go, Rust, Java, C#, PHP, Ruby), enabling complete dataflow tracking across multi-language codebases.

Why This Approach: The implementation follows the established pattern used in CFG/Complexity modules - using a DATAFLOW_DEFAULTS base + makeDataflowRules() factory + per-language rule objects + centralized DATAFLOW_RULES Map. This is architecturally sound and consistent with existing patterns.

Risk Assessment: Low risk - this is a feature extension following established patterns, not a fundamental architectural change.

Critical Concerns

1. Language-Specific Identifier Bug (Line 981-988) 🐛

The collectIdentifiers function hardcodes checking for 'identifier' type, missing language-specific identifier nodes:

function collectIdentifiers(node, out) {
  if (node.type === 'identifier') {  // ❌ Missing PHP's 'variable_name', etc.
    out.push(node.text);
    return;
  }
  for (const child of node.namedChildren) {
    collectIdentifiers(child, out);
  }
}

Impact: This causes incomplete referencedNames arrays in return statements for non-JS/TS languages. PHP variables like $x won't be captured, affecting dataflow completeness.

Evidence: PHP tests at line 36 don't assert referencedNames like other language tests do, suggesting this issue exists but tests don't validate it.

Fix Required: Use the existing isIdent(nodeType, rules) helper and pass rules parameter:

function collectIdentifiers(node, out, rules) {
  if (isIdent(node.type, rules)) {
    out.push(node.text);
    return;
  }
  // ... rest unchanged
}

2. Missing Test Coverage for PHP referencedNames

PHP dataflow tests don't validate referencedNames extraction in return statements, unlike all other languages. This masks the identifier bug above.

Configuration Changes ✅

No risky configuration changes detected:

  • No biome/linting rules modified
  • No test timeouts or coverage thresholds lowered
  • No build configuration changes
  • README legitimately updated to document new multi-language support

Test Integrity ✅

Strong test coverage: 57 tests across 8 language files covering parameters, returns, assignments, argFlows, and mutations. Tests are well-structured and follow consistent patterns. The missing PHP referencedNames test is a gap but doesn't weaken existing assertions.

Code Quality

Strengths:

  • Excellent adherence to established architecture patterns
  • Comprehensive language-specific handling for edge cases (Go expression_list, C# equals_value_clause, Java combined call+member, PHP identifier types)
  • Clean separation of concerns with rule-based configuration
  • Good use of DATAFLOW_EXTENSIONS Set derived from LANGUAGE_REGISTRY

Minor Issues:

  • Function could benefit from more error handling in edge cases
  • Some complex conditional logic could be simplified

Final Recommendation

  • Rating: ⭐⭐⭐☆☆ (3/5)
  • Action: APPROVE WITH CONDITIONS
  • Reasoning: This is a well-architected feature extension that follows established patterns and includes comprehensive test coverage. The identifier bug affects data completeness but not correctness - all tests pass and functionality works. The fix is straightforward and non-breaking.

Conditions for Approval:

  1. Fix collectIdentifiers function to use isIdent(node.type, rules) instead of hardcoded 'identifier' check
  2. Add referencedNames assertions to PHP tests to match other languages
  3. Both fixes can be done in a follow-up PR if time-sensitive

Merge Safety: Safe to merge as-is since the bug only affects completeness of referencedNames arrays, not core dataflow edge creation. All 1404 tests pass.


@claude
Copy link

claude bot commented Mar 4, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Dataflow Multi-Language Support

Reviewing PR #318 that extends dataflow analysis from JS/TS-only to all supported languages.

Review Progress:

  • Phase 1: Root Cause Analysis - Understand problem and approach justification
  • Phase 2: Configuration Change Detection - Check for high-risk config changes
  • Phase 3: Test Integrity - Verify test quality and coverage
  • Phase 4: Code Quality - Analyze implementation and identify issues
  • Final Assessment: Provide rating and recommendation

Root Cause Analysis

Problem Being Solved: Dataflow analysis was previously limited to JavaScript/TypeScript only. This PR extends it to all 8 supported languages (Python, Go, Rust, Java, C#, PHP, Ruby), enabling complete dataflow tracking across multi-language codebases.

Why This Approach: The implementation follows the established pattern used in CFG/Complexity modules - using a DATAFLOW_DEFAULTS base + makeDataflowRules() factory + per-language rule objects + centralized DATAFLOW_RULES Map. This is architecturally sound and consistent with existing patterns.

Risk Assessment: Low risk - this is a feature extension following established patterns, not a fundamental architectural change.

Critical Concerns

1. Language-Specific Identifier Bug (Line 981-988) 🐛

The collectIdentifiers function hardcodes checking for 'identifier' type, missing language-specific identifier nodes:

function collectIdentifiers(node, out) {
  if (node.type === 'identifier') {  // ❌ Missing PHP's 'variable_name', etc.
    out.push(node.text);
    return;
  }
  for (const child of node.namedChildren) {
    collectIdentifiers(child, out);
  }
}

Impact: This causes incomplete referencedNames arrays in return statements for non-JS/TS languages. PHP variables like $x won't be captured, affecting dataflow completeness.

Evidence: PHP tests at line 36 don't assert referencedNames like other language tests do, suggesting this issue exists but tests don't validate it.

Fix Required: Use the existing isIdent(nodeType, rules) helper and pass rules parameter:

function collectIdentifiers(node, out, rules) {
  if (isIdent(node.type, rules)) {
    out.push(node.text);
    return;
  }
  // ... rest unchanged
}

2. Missing Test Coverage for PHP referencedNames

PHP dataflow tests don't validate referencedNames extraction in return statements, unlike all other languages. This masks the identifier bug above.

Configuration Changes ✅

No risky configuration changes detected:

  • No biome/linting rules modified
  • No test timeouts or coverage thresholds lowered
  • No build configuration changes
  • README legitimately updated to document new multi-language support

Test Integrity ✅

Strong test coverage: 57 tests across 8 language files covering parameters, returns, assignments, argFlows, and mutations. Tests are well-structured and follow consistent patterns. The missing PHP referencedNames test is a gap but doesn't weaken existing assertions.

Code Quality

Strengths:

  • Excellent adherence to established architecture patterns
  • Comprehensive language-specific handling for edge cases (Go expression_list, C# equals_value_clause, Java combined call+member, PHP identifier types)
  • Clean separation of concerns with rule-based configuration
  • Good use of DATAFLOW_EXTENSIONS Set derived from LANGUAGE_REGISTRY

Minor Issues:

  • Function could benefit from more error handling in edge cases
  • Some complex conditional logic could be simplified

Final Recommendation

  • Rating: ⭐⭐⭐☆☆ (3/5)
  • Action: APPROVE WITH CONDITIONS
  • Reasoning: This is a well-architected feature extension that follows established patterns and includes comprehensive test coverage. The identifier bug affects data completeness but not correctness - all tests pass and functionality works. The fix is straightforward and non-breaking.

Conditions for Approval:

  1. Fix collectIdentifiers function to use isIdent(node.type, rules) instead of hardcoded 'identifier' check
  2. Add referencedNames assertions to PHP tests to match other languages
  3. Both fixes can be done in a follow-up PR if time-sensitive

Merge Safety: Safe to merge as-is since the bug only affects completeness of referencedNames arrays, not core dataflow edge creation. All 1404 tests pass.


@carlos-alm carlos-alm merged commit 03b6edb into main Mar 4, 2026
16 checks passed
@carlos-alm carlos-alm deleted the feat/dataflow-multi-language branch March 4, 2026 05:34
@github-actions github-actions bot locked and limited conversation to collaborators Mar 4, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant