-
Notifications
You must be signed in to change notification settings - Fork 3
fix(agents): validate UTF-8 boundaries in mention parsing #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Greptile OverviewGreptile SummaryAdds UTF-8 boundary validation to prevent panics when slicing strings at byte offsets that fall inside multi-byte characters. Introduces two helper functions ( Key changes:
Critical concern: The implementation silently adjusts boundaries rather than failing fast. Since Rust's regex crate already guarantees UTF-8 safe offsets from Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| src/cortex-agents/src/mention.rs | Added UTF-8 safe slicing helpers to prevent panics on multi-byte characters. Implementation has semantic issues with boundary adjustment behavior. |
Sequence Diagram
sequenceDiagram
participant User
participant Parser as parse_message_for_agent
participant Extract as extract_mention_and_text
participant Find as find_first_valid_mention
participant Regex as MENTION_REGEX
participant SafeSlice as safe_slice_up_to/safe_slice_from
User->>Parser: "日本語 @general task"
Parser->>Extract: extract_mention_and_text(text, valid_agents)
Extract->>Find: find_first_valid_mention(text, valid_agents)
Find->>Regex: captures_iter(text)
Regex-->>Find: AgentMention {start: 12, end: 20}
Note over Regex: Regex guarantees UTF-8 safe offsets
Find-->>Extract: Some(AgentMention)
Extract->>SafeSlice: safe_slice_up_to(text, 12)
Note over SafeSlice: Checks is_char_boundary(12)
SafeSlice-->>Extract: "日本語 "
Extract->>SafeSlice: safe_slice_from(text, 20)
SafeSlice-->>Extract: " task"
Extract-->>Parser: Some((mention, "日本語 task"))
Parser->>SafeSlice: safe_slice_up_to(text, 12).trim().is_empty()
SafeSlice-->>Parser: false
Parser-->>User: ParsedAgentMessage {should_invoke_task: false}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, 4 comments
| fn safe_slice_up_to(text: &str, pos: usize) -> &str { | ||
| if pos >= text.len() { | ||
| return text; | ||
| } | ||
| if text.is_char_boundary(pos) { | ||
| return &text[..pos]; | ||
| } | ||
| // Find the nearest valid boundary by searching backwards | ||
| let mut valid_pos = pos; | ||
| while valid_pos > 0 && !text.is_char_boundary(valid_pos) { | ||
| valid_pos -= 1; | ||
| } | ||
| &text[..valid_pos] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusting boundaries by searching backward/forward changes the semantic meaning of the slice. When mention.start or mention.end point inside a multi-byte character, this indicates the regex library returned invalid byte offsets.
The safer approach is to panic with a clear error message rather than silently returning incorrect data:
| fn safe_slice_up_to(text: &str, pos: usize) -> &str { | |
| if pos >= text.len() { | |
| return text; | |
| } | |
| if text.is_char_boundary(pos) { | |
| return &text[..pos]; | |
| } | |
| // Find the nearest valid boundary by searching backwards | |
| let mut valid_pos = pos; | |
| while valid_pos > 0 && !text.is_char_boundary(valid_pos) { | |
| valid_pos -= 1; | |
| } | |
| &text[..valid_pos] | |
| fn safe_slice_up_to(text: &str, pos: usize) -> &str { | |
| if pos >= text.len() { | |
| return text; | |
| } | |
| if text.is_char_boundary(pos) { | |
| return &text[..pos]; | |
| } | |
| panic!( | |
| "Invalid UTF-8 boundary at position {} in string of length {}. \ | |
| This indicates a bug in the mention regex parsing.", | |
| pos, text.len() | |
| ); | |
| } |
Rust's regex crate already guarantees UTF-8 safe offsets, so if this condition occurs it's a critical bug that should be surfaced, not masked.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cortex-agents/src/mention.rs
Line: 25:37
Comment:
Adjusting boundaries by searching backward/forward changes the semantic meaning of the slice. When `mention.start` or `mention.end` point inside a multi-byte character, this indicates the regex library returned invalid byte offsets.
The safer approach is to panic with a clear error message rather than silently returning incorrect data:
```suggestion
fn safe_slice_up_to(text: &str, pos: usize) -> &str {
if pos >= text.len() {
return text;
}
if text.is_char_boundary(pos) {
return &text[..pos];
}
panic!(
"Invalid UTF-8 boundary at position {} in string of length {}. \
This indicates a bug in the mention regex parsing.",
pos, text.len()
);
}
```
Rust's `regex` crate already guarantees UTF-8 safe offsets, so if this condition occurs it's a critical bug that should be surfaced, not masked.
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| fn safe_slice_from(text: &str, pos: usize) -> &str { | ||
| if pos >= text.len() { | ||
| return ""; | ||
| } | ||
| if text.is_char_boundary(pos) { | ||
| return &text[pos..]; | ||
| } | ||
| // Find the nearest valid boundary by searching forwards | ||
| let mut valid_pos = pos; | ||
| while valid_pos < text.len() && !text.is_char_boundary(valid_pos) { | ||
| valid_pos += 1; | ||
| } | ||
| &text[valid_pos..] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as safe_slice_up_to - adjusting boundaries forward can skip characters that should be included. Should panic instead:
| fn safe_slice_from(text: &str, pos: usize) -> &str { | |
| if pos >= text.len() { | |
| return ""; | |
| } | |
| if text.is_char_boundary(pos) { | |
| return &text[pos..]; | |
| } | |
| // Find the nearest valid boundary by searching forwards | |
| let mut valid_pos = pos; | |
| while valid_pos < text.len() && !text.is_char_boundary(valid_pos) { | |
| valid_pos += 1; | |
| } | |
| &text[valid_pos..] | |
| fn safe_slice_from(text: &str, pos: usize) -> &str { | |
| if pos >= text.len() { | |
| return ""; | |
| } | |
| if text.is_char_boundary(pos) { | |
| return &text[pos..]; | |
| } | |
| panic!( | |
| "Invalid UTF-8 boundary at position {} in string of length {}. \ | |
| This indicates a bug in the mention regex parsing.", | |
| pos, text.len() | |
| ); | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cortex-agents/src/mention.rs
Line: 45:57
Comment:
Same issue as `safe_slice_up_to` - adjusting boundaries forward can skip characters that should be included. Should panic instead:
```suggestion
fn safe_slice_from(text: &str, pos: usize) -> &str {
if pos >= text.len() {
return "";
}
if text.is_char_boundary(pos) {
return &text[pos..];
}
panic!(
"Invalid UTF-8 boundary at position {} in string of length {}. \
This indicates a bug in the mention regex parsing.",
pos, text.len()
);
}
```
How can I resolve this? If you propose a fix, please make it concise.| #[test] | ||
| fn test_safe_slice_from_multibyte() { | ||
| let text = "こんにちは"; | ||
| assert_eq!(safe_slice_from(text, 3), "んにちは"); // Valid boundary | ||
| // Position 4 is inside second character, should skip to position 6 | ||
| assert_eq!(safe_slice_from(text, 4), "にちは"); | ||
| assert_eq!(safe_slice_from(text, 5), "にちは"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test demonstrates the semantic issue. Position 4 is inside "ん" (bytes 3-5), so calling safe_slice_from(text, 4) skips forward to byte 6, which loses the character "ん" entirely.
Expected: panic or error indicating invalid offset
Actual: silently skips character
If mention.end pointed to byte 4, the remaining text after the mention would incorrectly exclude "ん".
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cortex-agents/src/mention.rs
Line: 383:390
Comment:
This test demonstrates the semantic issue. Position 4 is inside "ん" (bytes 3-5), so calling `safe_slice_from(text, 4)` skips forward to byte 6, which loses the character "ん" entirely.
Expected: panic or error indicating invalid offset
Actual: silently skips character
If `mention.end` pointed to byte 4, the remaining text after the mention would incorrectly exclude "ん".
How can I resolve this? If you propose a fix, please make it concise.
Additional Comments (1)
Check the stack trace from issue #5251 to verify the panic actually originates from this code path. Prompt To Fix With AIThis is a comment left during a code review.
Path: src/cortex-agents/src/mention.rs
Line: 109:118
Comment:
The regex crate's `Match::start()` and `Match::end()` already return UTF-8 safe byte offsets. If you're seeing panics with multi-byte characters, the root cause is likely elsewhere (e.g., string mutation between regex matching and slicing, or incorrect manual offset calculation).
Check the stack trace from issue #5251 to verify the panic actually originates from this code path.
How can I resolve this? If you propose a fix, please make it concise. |
This PR consolidates the following UTF-8 safety fixes: - #31: Use safe UTF-8 slicing in import command base64 extraction - #32: Use safe UTF-8 slicing for session IDs in notifications - #33: Use char-aware string truncation for UTF-8 safety in resume - #35: Use safe UTF-8 slicing for session IDs in lock command - #37: Validate UTF-8 boundaries in mention parsing All changes ensure safe string operations that respect UTF-8 boundaries: - Replaced direct byte slicing with char-aware methods - Added floor_char_boundary checks before slicing - Prevents panics from slicing multi-byte characters
|
Consolidated into #70 - fix: consolidated UTF-8 safety improvements for string slicing |
Summary
Fixes #5251 - Agent mention parsing panics on multi-byte UTF-8.
Problem
The mention parser uses direct byte-offset string slicing that can panic when the offset falls inside a multi-byte UTF-8 character.
Problematic patterns like
&text[..mention.start]wheremention.startis a byte offset that could be inside a multi-byte character.Solution
Added two helper functions for safe UTF-8 boundary-aware string slicing:
safe_slice_up_to(text, pos)- Returns slice from start to valid boundary at or beforepossafe_slice_from(text, pos)- Returns slice from valid boundary at or afterposto endApplied these functions to all three locations that previously used direct slicing:
extract_mention_and_text- for building remaining textstarts_with_mention- for checking prefixparse_message_for_agent- for checking if mention is at startTesting
cargo check -p cortex-agents