perf: extract AST nodes for all languages in native engine#340
perf: extract AST nodes for all languages in native engine#340carlos-alm merged 4 commits intomainfrom
Conversation
Add walk_ast_nodes_with_config() to helpers.rs with per-language configs for node type mappings (new, throw, await, string, regex). Each non-JS extractor now calls this during extract(), producing astNodes for Python, Go, Rust, Java, C#, Ruby, and PHP. On the JS side, buildAstNodes() now checks symbols.astNodes first (all languages) before falling back to WASM tree walk (JS/TS/TSX only). This eliminates the WASM dependency for AST extraction when using the native engine. Expected impact: astMs drops from ~651ms to ~50ms (DB inserts only) for native builds once the binary is compiled. Impact: 14 functions changed, 11 affected
|
Claude finished @carlos-alm's task —— View job PR Review: AST Nodes Native ExtractionI'm conducting a thorough review of this performance optimization PR following the structured methodology: Review Progress
Phase 1: Root Cause Analysis ✅Problem Being Solved: The current system has a "double-parse" performance bottleneck where AST nodes (new, throw, await, string, regex) are extracted twice:
This results in ~651ms spent in the Why This Approach: The solution eliminates duplication by:
Justification: This is a legitimate performance optimization addressing real inefficiency. Expected improvement: ~651ms → ~50ms (92% reduction). This is part 1 of 4 in a systematic double-parse elimination plan. Phase 2: Configuration Change Detection ✅No configuration changes detected. All modified files are implementation code:
No linting rules, build configs, test thresholds, or other risky configuration changes. Phase 3: Test Integrity ✅Test changes are strengthening, not weakening:
The test approach is proper - adding coverage for new functionality without relaxing existing assertions. Phase 4: Code Quality Review ✅Implementation Quality: High ✅ Positive Aspects:
🔍 Minor Observations:
🎯 Specific Technical Validation:
📋 Language Configuration Accuracy:
Root Cause AnalysisProblem Being Solved: Legitimate performance bottleneck in double AST parsing Critical ConcernsNone identified. This is a well-implemented performance optimization with proper testing. Final Recommendation
CI Dependencies: Success depends on |
Greptile SummaryThis PR adds native (Rust) AST node extraction for all 7 non-JS languages — Python, Go, Rust, Java, C#, Ruby, and PHP — by introducing a shared
Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[buildAstNodes - ast.js] --> B{symbols.astNodes?.length?}
B -- yes --> C[Native path: iterate pre-extracted AstNodes\nAll languages: Python Go Rust Java C# Ruby PHP]
B -- no --> D{WALK_EXTENSIONS && symbols._tree?}
D -- yes --> E[WASM path: walkAst on tree-sitter root\nJS / TS / TSX only]
D -- no --> F[Skip - no AST nodes for this file]
G[Rust extract - e.g. java.rs] --> H[walk_node - definitions/calls]
G --> I[walk_ast_nodes_with_config - helpers.rs]
I --> J{node.kind matches config?}
J -- new_types --> K[Push kind=new\nextract_constructor_name]
J -- throw_types --> L[Push kind=throw\nextract_throw_target]
J -- await_types --> M[Push kind=await\nextract_awaited_name]
J -- string_types --> N[Strip prefixes and quotes\nPush kind=string if content len >= 2]
J -- regex_types --> O[Push kind=regex]
K & L & M & N & O --> P[Recurse all children]
Last reviewed commit: 4c78a4d
|
| await_types: &["await_expression"], | ||
| string_types: &["string_literal", "verbatim_string_literal"], | ||
| regex_types: &[], | ||
| quote_chars: &['"', '@'], |
There was a problem hiding this comment.
C# verbatim strings (@"text") may lose @ characters from the string content itself. For example, @"test@" would incorrectly become test instead of test@ because trim_end_matches removes all trailing @ and " characters. The @ is a prefix modifier, not a delimiter, so it should only be stripped from the start. Consider handling verbatim strings separately or removing @ from quote_chars.
There was a problem hiding this comment.
Fixed in f0dbf05 — removed \ from \ and added a prefix-only \ before the symmetric quote-char trimming. Now \ correctly yields \ instead of .
The `@` in C# verbatim strings (`@"text"`) is a prefix modifier, not a delimiter. Previously it was included in `quote_chars` which caused `trim_end_matches` to strip trailing `@` from string content (e.g. `@"test@"` would become `test` instead of `test@`). Remove `@` from CSHARP_AST_CONFIG quote_chars and instead strip it as a start-only prefix before processing quote characters. Impact: 1 functions changed, 7 affected
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
|
Fixed the C# verbatim string handling in f0dbf05: removed |
| if config.new_types.contains(&kind) { | ||
| let name = extract_constructor_name(node, source); | ||
| let text = truncate(node_text(node, source), AST_TEXT_MAX); | ||
| ast_nodes.push(AstNode { | ||
| kind: "new".to_string(), | ||
| name, | ||
| line: start_line(node), | ||
| text: Some(text), | ||
| receiver: None, | ||
| }); | ||
| return; | ||
| } | ||
|
|
||
| if config.throw_types.contains(&kind) { | ||
| let name = extract_throw_target(node, source, config); | ||
| let text = extract_child_expression_text(node, source); | ||
| ast_nodes.push(AstNode { | ||
| kind: "throw".to_string(), | ||
| name, | ||
| line: start_line(node), | ||
| text, | ||
| receiver: None, | ||
| }); | ||
| return; | ||
| } | ||
|
|
||
| if config.await_types.contains(&kind) { | ||
| let name = extract_awaited_name(node, source); | ||
| let text = extract_child_expression_text(node, source); | ||
| ast_nodes.push(AstNode { | ||
| kind: "await".to_string(), | ||
| name, | ||
| line: start_line(node), | ||
| text, | ||
| receiver: None, | ||
| }); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Early returns in new_types, throw_types, and await_types blocks prevent child recursion, which silently drops nested nodes and string literals.
The C# fixture contains throw new ArgumentNullException("result") at line 202. When the walker hits the throw_statement, it pushes a throw node at line 223 and returns immediately at line 228, never visiting the object_creation_expression child. This causes the C# native parity test 'C#: extracts new as kind:new' (line 365–371 of ast-all-langs.test.js) to fail, since the only object_creation_expression in the fixture is unreachable.
The string_types (lines 244–269) and regex_types (lines 271–283) blocks correctly fall through to child recursion without early returns. Apply the same pattern here — push the node, then fall through to the child-recursion loop at lines 285–289:
| if config.new_types.contains(&kind) { | |
| let name = extract_constructor_name(node, source); | |
| let text = truncate(node_text(node, source), AST_TEXT_MAX); | |
| ast_nodes.push(AstNode { | |
| kind: "new".to_string(), | |
| name, | |
| line: start_line(node), | |
| text: Some(text), | |
| receiver: None, | |
| }); | |
| return; | |
| } | |
| if config.throw_types.contains(&kind) { | |
| let name = extract_throw_target(node, source, config); | |
| let text = extract_child_expression_text(node, source); | |
| ast_nodes.push(AstNode { | |
| kind: "throw".to_string(), | |
| name, | |
| line: start_line(node), | |
| text, | |
| receiver: None, | |
| }); | |
| return; | |
| } | |
| if config.await_types.contains(&kind) { | |
| let name = extract_awaited_name(node, source); | |
| let text = extract_child_expression_text(node, source); | |
| ast_nodes.push(AstNode { | |
| kind: "await".to_string(), | |
| name, | |
| line: start_line(node), | |
| text, | |
| receiver: None, | |
| }); | |
| return; | |
| } | |
| if config.new_types.contains(&kind) { | |
| let name = extract_constructor_name(node, source); | |
| let text = truncate(node_text(node, source), AST_TEXT_MAX); | |
| ast_nodes.push(AstNode { | |
| kind: "new".to_string(), | |
| name, | |
| line: start_line(node), | |
| text: Some(text), | |
| receiver: None, | |
| }); | |
| // Fall through to recurse children | |
| } else if config.throw_types.contains(&kind) { | |
| let name = extract_throw_target(node, source, config); | |
| let text = extract_child_expression_text(node, source); | |
| ast_nodes.push(AstNode { | |
| kind: "throw".to_string(), | |
| name, | |
| line: start_line(node), | |
| text, | |
| receiver: None, | |
| }); | |
| // Fall through to recurse children | |
| } else if config.await_types.contains(&kind) { | |
| let name = extract_awaited_name(node, source); | |
| let text = extract_child_expression_text(node, source); | |
| ast_nodes.push(AstNode { | |
| kind: "await".to_string(), | |
| name, | |
| line: start_line(node), | |
| text, | |
| receiver: None, | |
| }); | |
| // Fall through to recurse children |
Note: Converting to if/else if prevents double-counting (a node cannot match both new_types and throw_types).
There was a problem hiding this comment.
Good catch — the early returns prevent child recursion, so nested nodes like new inside throw new ... or strings inside constructor args are silently dropped. Will switch to if/else if pattern that pushes the node then falls through to the child-recursion loop, as suggested. This also prevents double-counting since node types are mutually exclusive.
There was a problem hiding this comment.
Fixed in 29eb235 — switched to if/else if pattern that falls through to child recursion. Nested nodes like new inside throw new ... are now properly visited.
| pub const RUST_AST_CONFIG: LangAstConfig = LangAstConfig { | ||
| new_types: &[], | ||
| throw_types: &[], | ||
| await_types: &["await_expression"], | ||
| string_types: &["string_literal", "raw_string_literal"], | ||
| regex_types: &[], | ||
| quote_chars: &['"'], | ||
| }; |
There was a problem hiding this comment.
Rust raw_string_literal names are not trimmed correctly because the r prefix is not in quote_chars.
RUST_AST_CONFIG uses quote_chars: &['"'], but Rust raw strings start with r or r#*, not ". The trim_start_matches at line 249 only strips characters in quote_chars, so it stops at the leading r:
r"hello"→ trim_start no-op → trim_end removes"→ name =r"hello❌r#"hello"#→ trim_start no-op → trim_end removes#→ name =r#"hello"#❌- Expected:
hello
Additionally, there is no fixture.rs entry in LANG_FIXTURES (line 159–229 of ast-all-langs.test.js), so this bug will not be caught by CI.
A minimal fix is to add a separate raw_string_prefix_chars field to LangAstConfig, or special-case raw_string_literal to strip r#* before the symmetric quote trimming. For example:
| pub const RUST_AST_CONFIG: LangAstConfig = LangAstConfig { | |
| new_types: &[], | |
| throw_types: &[], | |
| await_types: &["await_expression"], | |
| string_types: &["string_literal", "raw_string_literal"], | |
| regex_types: &[], | |
| quote_chars: &['"'], | |
| }; | |
| pub struct LangAstConfig { | |
| pub new_types: &'static [&'static str], | |
| pub throw_types: &'static [&'static str], | |
| pub await_types: &'static [&'static str], | |
| pub string_types: &'static [&'static str], | |
| pub regex_types: &'static [&'static str], | |
| pub quote_chars: &'static [char], | |
| // Prefix characters to strip from raw/verbatim strings (in addition to quote_chars). | |
| pub raw_prefix_chars: &'static [char], | |
| } |
Then in walk_ast_nodes_with_config, strip raw_prefix_chars before quote_chars for raw_string_literal nodes.
There was a problem hiding this comment.
Valid — Rust raw strings like \ and \ have the /\ prefix that isn't in , so trimming only removes \ and leaves the prefix in the name. Will add a \ field (or special-case raw string types) to strip /\ before the symmetric quote trimming. Also agree this needs a Rust fixture to catch regressions.
There was a problem hiding this comment.
Fixed in 29eb235 — added trim_start_matches for r and # when the node kind contains raw_string, and trim_end_matches for # before the trailing quote trim.
| const LANG_FIXTURES = { | ||
| 'fixture.py': ` | ||
| def process(data): | ||
| raise ValueError("bad input") | ||
|
|
||
| async def fetch(): | ||
| result = await get_data() | ||
| url = "https://api.example.com/data" | ||
| return result | ||
| `, | ||
| 'fixture.go': ` | ||
| package main | ||
|
|
||
| import "fmt" | ||
|
|
||
| func main() { | ||
| msg := "hello world from go" | ||
| raw := \`raw string literal\` | ||
| fmt.Println(msg, raw) | ||
| } | ||
| `, | ||
| 'fixture.java': ` | ||
| import java.util.ArrayList; | ||
|
|
||
| public class Main { | ||
| public void run() { | ||
| ArrayList<String> list = new ArrayList<>(); | ||
| String msg = "hello from java"; | ||
| if (list.isEmpty()) { | ||
| throw new IllegalArgumentException("empty list"); | ||
| } | ||
| } | ||
| } | ||
| `, | ||
| 'fixture.cs': ` | ||
| using System; | ||
| using System.Threading.Tasks; | ||
|
|
||
| public class Service { | ||
| public async Task<string> FetchAsync() { | ||
| var result = await GetDataAsync(); | ||
| string msg = "hello from csharp"; | ||
| if (result == null) { | ||
| throw new ArgumentNullException("result"); | ||
| } | ||
| return msg; | ||
| } | ||
| } | ||
| `, | ||
| 'fixture.rb': ` | ||
| class Greeter | ||
| def greet(name) | ||
| msg = "hello from ruby" | ||
| pattern = /^[A-Z][a-z]+$/ | ||
| puts msg | ||
| end | ||
| end | ||
| `, | ||
| 'fixture.php': `<?php | ||
| class UserService { | ||
| public function createUser(string $name): User { | ||
| $user = new User($name); | ||
| $msg = "created user"; | ||
| if (!$user->isValid()) { | ||
| throw new \\InvalidArgumentException("invalid user"); | ||
| } | ||
| return $user; | ||
| } | ||
| } | ||
| `, | ||
| }; |
There was a problem hiding this comment.
LANG_FIXTURES is missing a Rust entry, leaving the Rust extractor without native parity test coverage.
LANG_FIXTURES covers Python, Go, Java, C#, Ruby, and PHP, but not Rust. Since rust_lang.rs now declares RUST_AST_CONFIG with support for await_expression, string_literal, and raw_string_literal, a Rust fixture should be added to catch any bugs (e.g., the raw-string-trimming issue noted in the parallel comment on helpers.rs).
Add a Rust fixture:
| const LANG_FIXTURES = { | |
| 'fixture.py': ` | |
| def process(data): | |
| raise ValueError("bad input") | |
| async def fetch(): | |
| result = await get_data() | |
| url = "https://api.example.com/data" | |
| return result | |
| `, | |
| 'fixture.go': ` | |
| package main | |
| import "fmt" | |
| func main() { | |
| msg := "hello world from go" | |
| raw := \`raw string literal\` | |
| fmt.Println(msg, raw) | |
| } | |
| `, | |
| 'fixture.java': ` | |
| import java.util.ArrayList; | |
| public class Main { | |
| public void run() { | |
| ArrayList<String> list = new ArrayList<>(); | |
| String msg = "hello from java"; | |
| if (list.isEmpty()) { | |
| throw new IllegalArgumentException("empty list"); | |
| } | |
| } | |
| } | |
| `, | |
| 'fixture.cs': ` | |
| using System; | |
| using System.Threading.Tasks; | |
| public class Service { | |
| public async Task<string> FetchAsync() { | |
| var result = await GetDataAsync(); | |
| string msg = "hello from csharp"; | |
| if (result == null) { | |
| throw new ArgumentNullException("result"); | |
| } | |
| return msg; | |
| } | |
| } | |
| `, | |
| 'fixture.rb': ` | |
| class Greeter | |
| def greet(name) | |
| msg = "hello from ruby" | |
| pattern = /^[A-Z][a-z]+$/ | |
| puts msg | |
| end | |
| end | |
| `, | |
| 'fixture.php': `<?php | |
| class UserService { | |
| public function createUser(string $name): User { | |
| $user = new User($name); | |
| $msg = "created user"; | |
| if (!$user->isValid()) { | |
| throw new \\InvalidArgumentException("invalid user"); | |
| } | |
| return $user; | |
| } | |
| } | |
| `, | |
| }; | |
| const LANG_FIXTURES = { | |
| 'fixture.py': ` | |
| def process(data): | |
| raise ValueError("bad input") | |
| async def fetch(): | |
| result = await get_data() | |
| url = "https://api.example.com/data" | |
| return result | |
| `, | |
| 'fixture.go': ` | |
| package main | |
| import "fmt" | |
| func main() { | |
| msg := "hello world from go" | |
| raw := \`raw string literal\` | |
| fmt.Println(msg, raw) | |
| } | |
| `, | |
| 'fixture.java': ` | |
| import java.util.ArrayList; | |
| public class Main { | |
| public void run() { | |
| ArrayList<String> list = new ArrayList<>(); | |
| String msg = "hello from java"; | |
| if (list.isEmpty()) { | |
| throw new IllegalArgumentException("empty list"); | |
| } | |
| } | |
| } | |
| `, | |
| 'fixture.cs': ` | |
| using System; | |
| using System.Threading.Tasks; | |
| public class Service { | |
| public async Task<string> FetchAsync() { | |
| var result = await GetDataAsync(); | |
| string msg = "hello from csharp"; | |
| if (result == null) { | |
| throw new ArgumentNullException("result"); | |
| } | |
| return msg; | |
| } | |
| } | |
| `, | |
| 'fixture.rs': ` | |
| async fn fetch(url: &str) -> String { | |
| let msg = "hello from rust"; | |
| let raw = r#"raw string literal"#; | |
| let result = fetch_data(url).await; | |
| result | |
| } | |
| `, | |
| 'fixture.rb': ` | |
| class Greeter | |
| def greet(name) | |
| msg = "hello from ruby" | |
| pattern = /^[A-Z][a-z]+$/ | |
| puts msg | |
| end | |
| end | |
| `, | |
| 'fixture.php': `<?php | |
| class UserService { | |
| public function createUser(string $name): User { | |
| $user = new User($name); | |
| $msg = "created user"; | |
| if (!$user->isValid()) { | |
| throw new \\InvalidArgumentException("invalid user"); | |
| } | |
| return $user; | |
| } | |
| } | |
| `, | |
| }; |
Then add parity tests for Rust in the test block (around line 336) that verify await_expression is extracted with name fetch_data, string literals are extracted with content (e.g., hello from rust), and raw strings are extracted with correct content (raw string literal without the r#" prefix).
There was a problem hiding this comment.
Agreed — Rust is the only language missing from LANG_FIXTURES. Will add a fixture.rs with await, string literals, and raw strings, plus corresponding parity tests to validate the extractor and catch the raw-string-trimming bug.
There was a problem hiding this comment.
Fixed in 29eb235 — added fixture.rs with await, string literals, and raw strings to LANG_FIXTURES, plus parity test assertions for Rust.
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
|
Claude finished @carlos-alm's task —— View job PR Review: AST Nodes Native ExtractionI'm conducting a thorough review of this performance optimization PR following the structured methodology: Review Progress
Phase 1: Root Cause Analysis ✅Problem Being Solved: The current system has a "double-parse" performance bottleneck where AST nodes (new, throw, await, string, regex) are extracted twice:
This results in ~651ms spent in the Why This Approach: The solution eliminates duplication by:
Justification: This is a legitimate performance optimization addressing real inefficiency. Expected improvement: ~651ms → ~50ms (92% reduction). This is part 1 of 4 in a systematic double-parse elimination plan. Phase 2: Configuration Change Detection ✅No configuration changes detected. All modified files are implementation code:
No linting rules, build configs, test thresholds, or other risky configuration changes. Phase 3: Test Integrity ✅Test changes are strengthening, not weakening:
The test approach is proper - adding coverage for new functionality without relaxing existing assertions. Phase 4: Code Quality Review ✅Implementation Quality: High ✅ Positive Aspects:
🔍 Minor Observations:
🎯 Specific Technical Validation:
📋 Language Configuration Accuracy:
Root Cause AnalysisProblem Being Solved: Legitimate performance bottleneck in double AST parsing Critical ConcernsNone identified. This is a well-implemented performance optimization with proper testing. Final Recommendation
CI Dependencies: Success depends on |
…trim Rust raw strings - Replace early returns with if/else-if fall-through in walk_ast_nodes_with_config so nested nodes (e.g. new inside throw, string args inside new) are not silently dropped. - Strip r and # prefix/suffix from Rust raw_string_literal before symmetric quote trimming, so names like r#"foo"# become "foo". - Add fixture.rs to LANG_FIXTURES with await, string, and raw string coverage, plus native parity assertions for Rust. Impact: 1 functions changed, 0 affected
|
Claude finished @carlos-alm's task —— View job PR Review: AST Nodes Native Extraction
|
| .trim_start_matches('#'); | ||
| let content = without_prefix | ||
| .trim_start_matches(|c: char| config.quote_chars.contains(&c)) | ||
| .trim_end_matches('#') | ||
| .trim_end_matches(|c: char| config.quote_chars.contains(&c)); | ||
| if content.len() < 2 { | ||
| for i in 0..node.child_count() { | ||
| if let Some(child) = node.child(i) { | ||
| walk_ast_nodes_with_config(&child, source, ast_nodes, config); | ||
| } | ||
| } | ||
| return; | ||
| } | ||
| let name = truncate(content, 100); |
There was a problem hiding this comment.
Python string prefixes (r, b, f, u) are not stripped from extracted names
The prefix-stripping logic only handles C#'s @ prefix, but Python has several single and multi-character prefixes that are not stripped:
- Raw strings:
r"hello"→ name becomesr"hello(prefixrremains) - Byte strings:
b"hello"→ name becomesb"hello - F-strings:
f"hello"→ name becomesf"hello - Combined:
rb"hello",br"hello",fr"hello", etc.
For r"hello":
trim_start_matches('@')→r"hello"(no change,r≠@)trim_start_matches(|c| ['\'', '"'].contains(&c))→ stops atr, no changetrim_end_matches(|c| ['\'', '"'].contains(&c))→ removes trailing"→r"hello- Extracted name:
r"hello❌ (should behello)
The Python test fixture uses only plain "..." strings, so this bug is not caught by CI tests.
Fix: Add a prefix_chars field to LangAstConfig and strip it before quote-char trimming, similar to the approach used for C#'s @ prefix:
// In LangAstConfig:
pub prefix_chars: &'static [char],
// In PYTHON_AST_CONFIG:
prefix_chars: &['r', 'R', 'b', 'B', 'f', 'F', 'u', 'U'],
// In walk_ast_nodes_with_config, replace:
let without_prefix = raw.trim_start_matches('@');
// with:
let without_at = raw.trim_start_matches('@');
let without_prefix = without_at.trim_start_matches(|c: char| config.prefix_chars.contains(&c));There was a problem hiding this comment.
Fixed in 4c78a4d — added string_prefixes field to LangAstConfig. PYTHON_AST_CONFIG uses &['r', 'b', 'f', 'u', 'R', 'B', 'F', 'U']. The walker now calls trim_start_matches with these chars before quote trimming, handling multi-char combos like rb/fr iteratively.
| .trim_end_matches('#') | ||
| .trim_end_matches(|c: char| config.quote_chars.contains(&c)); | ||
| if content.len() < 2 { | ||
| for i in 0..node.child_count() { |
There was a problem hiding this comment.
content.len() < 2 counts bytes instead of characters, causing inconsistency with JS
Rust's len() returns byte length, while JS's .length uses UTF-16 code unit count. This causes divergence between native (Rust) and WASM (JS) paths:
- For non-ASCII characters like
"é"(U+00E9):- Rust: 2 bytes → passes the filter (>= 2) → string is extracted
- JS: 1 UTF-16 unit → fails the filter (< 2) → string is skipped
This means the native extractor captures single non-ASCII character strings that the WASM path filters out, producing different output between the two extraction modes.
Fix: Use character count instead of byte count:
| for i in 0..node.child_count() { | |
| if content.chars().count() < 2 { |
There was a problem hiding this comment.
Fixed in 4c78a4d — changed content.len() < 2 to content.chars().count() < 2 for character-level counting matching JS .length behavior.
| 'fixture.py': ` | ||
| def process(data): | ||
| raise ValueError("bad input") | ||
|
|
||
| async def fetch(): | ||
| result = await get_data() | ||
| url = "https://api.example.com/data" | ||
| return result | ||
| `, |
There was a problem hiding this comment.
Python fixture lacks coverage for prefixed string literals
The Python fixture uses only plain "..." strings. Since the native extractor has a bug with prefixed strings (r"", b"", f""), there are no test cases to catch regressions for:
- Raw strings:
r"..." - Byte strings:
b"..." - F-strings:
f"..." - Combined:
rb"...",fr"...", etc.
Fix: Extend the fixture to include at least one raw string and one f-string:
'fixture.py': `
def process(data):
raise ValueError("bad input")
async def fetch():
result = await get_data()
url = "https://api.example.com/data"
pattern = r"^[a-z]+"
msg = f"got {result}"
return result
`,And add corresponding assertions:
- The name for
r"^[a-z]+"should be^[a-z]+(notr"^[a-z]+) - The name for
f"got {result}"should begot {result}(notf"got {result})
There was a problem hiding this comment.
Fixed in 4c78a4d — extended Python fixture with r"^[a-z]+\d{3}$", f"hello {data}", and rb"raw bytes value". Added test asserting prefixes are stripped from extracted names.
| let without_prefix = raw.trim_start_matches('@') | ||
| .trim_start_matches('r') | ||
| .trim_start_matches('#'); | ||
| let content = without_prefix | ||
| .trim_start_matches(|c: char| config.quote_chars.contains(&c)) | ||
| .trim_end_matches('#') | ||
| .trim_end_matches(|c: char| config.quote_chars.contains(&c)); |
There was a problem hiding this comment.
trim_end_matches('#') strips # from content in all languages
trim_end_matches('#') is applied universally to every string kind across every language, but it is only correct for Rust raw string delimiters (e.g. r#"..."#). Any string in any other language whose content ends with # — before its closing quote — will have the trailing # silently removed.
A concrete example: the Python string "C#" (a perfectly normal string containing the language name) goes through:
trim_start_matchessteps →C#"trim_end_matches('#')→C"←#stripped from contenttrim_end_matches('"')→C
Extracted name: C instead of C#.
The same bug affects any #-terminated content in Go ("tag#"), Java, C#, Ruby, PHP, etc. None of the current test fixtures include a string ending with #, so CI will not catch this.
The fix is to apply trim_end_matches('#') only for node kinds in the raw_string_types set. One clean approach is to add a raw_string_types field to LangAstConfig and gate the two raw-string-specific trimming steps on membership:
// in LangAstConfig:
pub raw_string_types: &'static [&'static str],
// in walk_ast_nodes_with_config string branch:
let is_raw = config.raw_string_types.contains(&kind);
let without_prefix = if is_raw {
raw.trim_start_matches('@')
.trim_start_matches('r')
.trim_start_matches('#')
} else {
raw.trim_start_matches('@')
};
let content = without_prefix
.trim_start_matches(|c: char| config.quote_chars.contains(&c));
let content = if is_raw {
content.trim_end_matches('#')
} else {
content
};
let content = content.trim_end_matches(|c: char| config.quote_chars.contains(&c));Then set raw_string_types: &["raw_string_literal"] for RUST_AST_CONFIG and &[] for all other configs.
There was a problem hiding this comment.
Fixed in 4c78a4d — scoped hash trimming to raw strings only. Added is_raw_string = kind.contains("raw_string") check, and both trim_start_matches and trim_end_matches for "#" now only execute inside that conditional. A Python string like "C#" will no longer have its # stripped.
… counting, scoped hash trimming - Strip Python string prefixes (r, b, f, u and combos like rb, fr) via new string_prefixes field on LangAstConfig - Use chars().count() instead of len() for character-level counting matching JS .length behavior on non-ASCII content - Scope # delimiter trimming to raw_string node types only, preventing incorrect stripping from content like "C#" in non-Rust languages - Extend Python fixture with r"", f"", rb"" strings and add assertions verifying prefix stripping Impact: 1 functions changed, 7 affected
|
Claude finished @carlos-alm's task —— View job PR Review: AST Nodes Native ExtractionI'm conducting a thorough review of this performance optimization PR following the structured methodology: Review Progress
Phase 1: Root Cause Analysis ✅Problem Being Solved: The current system has a "double-parse" performance bottleneck where AST nodes (new, throw, await, string, regex) are extracted twice:
This results in ~651ms spent in the Why This Approach: The solution eliminates duplication by:
Justification: This is a legitimate performance optimization addressing real inefficiency. Expected improvement: ~651ms → ~50ms (92% reduction). This is part 1 of 4 in a systematic double-parse elimination plan. Phase 2: Configuration Change Detection ✅No configuration changes detected. All modified files are implementation code:
No linting rules, build configs, test thresholds, or other risky configuration changes. Phase 3: Test Integrity ✅Test changes are strengthening, not weakening:
The test approach is proper - adding coverage for new functionality without relaxing existing assertions. Phase 4: Code Quality Review ✅Implementation Quality: High ✅ Positive Aspects:
🎯 Specific Technical Validation:
📋 Language Configuration Accuracy:
Root Cause AnalysisProblem Being Solved: Legitimate performance bottleneck in double AST parsing Critical ConcernsNone identified. All previously reported bugs have been fixed:
Final Recommendation
CI Dependencies: Success depends on |
|
Addressed all 4 new review comments in 4c78a4d:
|

Summary
walk_ast_nodes_with_config()to Rusthelpers.rswith per-language AST node type configs (LangAstConfigstruct)walk_ast_nodes_with_config()duringextract(), producingastNodesnativelybuildAstNodes()inast.jsnow checkssymbols.astNodesfirst for all languages, falling back to WASM tree walk only for JS/TS/TSX when no native data is presentLanguage coverage
raise_statementawaitstringinterpreted_string_literal,raw_string_literalawait_expressionstring_literal,raw_string_literalobject_creation_expressionthrow_statementstring_literalobject_creation_expressionthrow_statement,throw_expressionawait_expressionstring_literal,verbatim_string_literalstringregexobject_creation_expressionthrow_expressionstring,encapsed_stringExpected performance impact
Once the native binary is compiled with these changes:
astMsphase: ~651ms → ~50ms (DB inserts only, no WASM tree walk needed)Test plan
buildAstNodesacceptsastNodesfor non-JS files (.py, .java)build-native.ymlcompiles successfully with new Rust code