fix(agents): validate UTF-8 boundaries in mention parsing #37

echobt · 2026-02-04T14:53:30Z

Summary

Fixes #5251 - Agent mention parsing panics on multi-byte UTF-8.

Problem

The mention parser uses direct byte-offset string slicing that can panic when the offset falls inside a multi-byte UTF-8 character.

Problematic patterns like &text[..mention.start] where mention.start is a byte offset that could be inside a multi-byte character.

Solution

Added two helper functions for safe UTF-8 boundary-aware string slicing:

safe_slice_up_to(text, pos) - Returns slice from start to valid boundary at or before pos
safe_slice_from(text, pos) - Returns slice from valid boundary at or after pos to end

Applied these functions to all three locations that previously used direct slicing:

extract_mention_and_text - for building remaining text
starts_with_mention - for checking prefix
parse_message_for_agent - for checking if mention is at start

Testing

Added 8 new tests covering multi-byte UTF-8 scenarios:
- Japanese characters (3-byte UTF-8)
- Emoji (4-byte UTF-8)
- Mixed ASCII, CJK, and emoji content
All 17 mention tests pass
Verified with cargo check -p cortex-agents

greptile-apps · 2026-02-04T14:56:48Z

Greptile Overview

Greptile Summary

Adds UTF-8 boundary validation to prevent panics when slicing strings at byte offsets that fall inside multi-byte characters. Introduces two helper functions (safe_slice_up_to and safe_slice_from) that adjust offsets to the nearest character boundary.

Key changes:

Added safe_slice_up_to: searches backward for valid boundary
Added safe_slice_from: searches forward for valid boundary
Applied helpers to 3 slicing locations in mention parsing
Added 8 new tests covering Japanese, emoji, and mixed UTF-8

Critical concern: The implementation silently adjusts boundaries rather than failing fast. Since Rust's regex crate already guarantees UTF-8 safe offsets from Match::start() and Match::end(), boundary adjustment masks potential bugs elsewhere. If an invalid offset occurs, it indicates a real issue that should be surfaced with a clear panic message rather than hidden by silently skipping characters.

Confidence Score: 3/5

Safe to merge but implementation approach should be reconsidered
The code prevents panics and adds comprehensive test coverage, but silently adjusting UTF-8 boundaries can cause data loss (characters being skipped). Since the regex crate guarantees safe offsets, boundary violations indicate bugs that should be surfaced, not masked. The current approach trades correctness for robustness.
Review the semantic behavior of boundary adjustment in src/cortex-agents/src/mention.rs

Important Files Changed

Filename	Overview
src/cortex-agents/src/mention.rs	Added UTF-8 safe slicing helpers to prevent panics on multi-byte characters. Implementation has semantic issues with boundary adjustment behavior.

Sequence Diagram

sequenceDiagram
    participant User
    participant Parser as parse_message_for_agent
    participant Extract as extract_mention_and_text
    participant Find as find_first_valid_mention
    participant Regex as MENTION_REGEX
    participant SafeSlice as safe_slice_up_to/safe_slice_from
    
    User->>Parser: "日本語 @general task"
    Parser->>Extract: extract_mention_and_text(text, valid_agents)
    Extract->>Find: find_first_valid_mention(text, valid_agents)
    Find->>Regex: captures_iter(text)
    Regex-->>Find: AgentMention {start: 12, end: 20}
    Note over Regex: Regex guarantees UTF-8 safe offsets
    Find-->>Extract: Some(AgentMention)
    Extract->>SafeSlice: safe_slice_up_to(text, 12)
    Note over SafeSlice: Checks is_char_boundary(12)
    SafeSlice-->>Extract: "日本語 "
    Extract->>SafeSlice: safe_slice_from(text, 20)
    SafeSlice-->>Extract: " task"
    Extract-->>Parser: Some((mention, "日本語  task"))
    Parser->>SafeSlice: safe_slice_up_to(text, 12).trim().is_empty()
    SafeSlice-->>Parser: false
    Parser-->>User: ParsedAgentMessage {should_invoke_task: false}

greptile-apps

_{1 file reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-04T14:56:52Z

src/cortex-agents/src/mention.rs

+fn safe_slice_up_to(text: &str, pos: usize) -> &str {
+    if pos >= text.len() {
+        return text;
+    }
+    if text.is_char_boundary(pos) {
+        return &text[..pos];
+    }
+    // Find the nearest valid boundary by searching backwards
+    let mut valid_pos = pos;
+    while valid_pos > 0 && !text.is_char_boundary(valid_pos) {
+        valid_pos -= 1;
+    }
+    &text[..valid_pos]


Adjusting boundaries by searching backward/forward changes the semantic meaning of the slice. When mention.start or mention.end point inside a multi-byte character, this indicates the regex library returned invalid byte offsets.

The safer approach is to panic with a clear error message rather than silently returning incorrect data:

Suggested change

fn safe_slice_up_to(text: &str, pos: usize) -> &str {

if pos >= text.len() {

return text;

}

if text.is_char_boundary(pos) {

return &text[..pos];

}

// Find the nearest valid boundary by searching backwards

let mut valid_pos = pos;

while valid_pos > 0 && !text.is_char_boundary(valid_pos) {

valid_pos -= 1;

}

&text[..valid_pos]

fn safe_slice_up_to(text: &str, pos: usize) -> &str {

if pos >= text.len() {

return text;

}

if text.is_char_boundary(pos) {

return &text[..pos];

}

panic!(

"Invalid UTF-8 boundary at position {} in string of length {}. \

This indicates a bug in the mention regex parsing.",

pos, text.len()

);

}

Rust's regex crate already guarantees UTF-8 safe offsets, so if this condition occurs it's a critical bug that should be surfaced, not masked.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Prompt To Fix With AI

This is a comment left during a code review. Path: src/cortex-agents/src/mention.rs Line: 25:37 Comment: Adjusting boundaries by searching backward/forward changes the semantic meaning of the slice. When `mention.start` or `mention.end` point inside a multi-byte character, this indicates the regex library returned invalid byte offsets. The safer approach is to panic with a clear error message rather than silently returning incorrect data: ```suggestion fn safe_slice_up_to(text: &str, pos: usize) -> &str { if pos >= text.len() { return text; } if text.is_char_boundary(pos) { return &text[..pos]; } panic!( "Invalid UTF-8 boundary at position {} in string of length {}. \ This indicates a bug in the mention regex parsing.", pos, text.len() ); } ``` Rust's `regex` crate already guarantees UTF-8 safe offsets, so if this condition occurs it's a critical bug that should be surfaced, not masked. <sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub> How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-04T14:56:53Z

src/cortex-agents/src/mention.rs

+fn safe_slice_from(text: &str, pos: usize) -> &str {
+    if pos >= text.len() {
+        return "";
+    }
+    if text.is_char_boundary(pos) {
+        return &text[pos..];
+    }
+    // Find the nearest valid boundary by searching forwards
+    let mut valid_pos = pos;
+    while valid_pos < text.len() && !text.is_char_boundary(valid_pos) {
+        valid_pos += 1;
+    }
+    &text[valid_pos..]


Same issue as safe_slice_up_to - adjusting boundaries forward can skip characters that should be included. Should panic instead:

Suggested change

fn safe_slice_from(text: &str, pos: usize) -> &str {

if pos >= text.len() {

return "";

}

if text.is_char_boundary(pos) {

return &text[pos..];

}

// Find the nearest valid boundary by searching forwards

let mut valid_pos = pos;

while valid_pos < text.len() && !text.is_char_boundary(valid_pos) {

valid_pos += 1;

}

&text[valid_pos..]

fn safe_slice_from(text: &str, pos: usize) -> &str {

if pos >= text.len() {

return "";

}

if text.is_char_boundary(pos) {

return &text[pos..];

}

panic!(

"Invalid UTF-8 boundary at position {} in string of length {}. \

This indicates a bug in the mention regex parsing.",

pos, text.len()

);

}

Prompt To Fix With AI

This is a comment left during a code review. Path: src/cortex-agents/src/mention.rs Line: 45:57 Comment: Same issue as `safe_slice_up_to` - adjusting boundaries forward can skip characters that should be included. Should panic instead: ```suggestion fn safe_slice_from(text: &str, pos: usize) -> &str { if pos >= text.len() { return ""; } if text.is_char_boundary(pos) { return &text[pos..]; } panic!( "Invalid UTF-8 boundary at position {} in string of length {}. \ This indicates a bug in the mention regex parsing.", pos, text.len() ); } ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-04T14:56:58Z

src/cortex-agents/src/mention.rs

+    #[test]
+    fn test_safe_slice_from_multibyte() {
+        let text = "こんにちは";
+        assert_eq!(safe_slice_from(text, 3), "んにちは"); // Valid boundary
+        // Position 4 is inside second character, should skip to position 6
+        assert_eq!(safe_slice_from(text, 4), "にちは");
+        assert_eq!(safe_slice_from(text, 5), "にちは");
+    }


This test demonstrates the semantic issue. Position 4 is inside "ん" (bytes 3-5), so calling safe_slice_from(text, 4) skips forward to byte 6, which loses the character "ん" entirely.

Expected: panic or error indicating invalid offset
Actual: silently skips character

If mention.end pointed to byte 4, the remaining text after the mention would incorrectly exclude "ん".

Prompt To Fix With AI

This is a comment left during a code review. Path: src/cortex-agents/src/mention.rs Line: 383:390 Comment: This test demonstrates the semantic issue. Position 4 is inside "ん" (bytes 3-5), so calling `safe_slice_from(text, 4)` skips forward to byte 6, which loses the character "ん" entirely. Expected: panic or error indicating invalid offset Actual: silently skips character If `mention.end` pointed to byte 4, the remaining text after the mention would incorrectly exclude "ん". How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-02-04T14:57:01Z

Additional Comments (1)

src/cortex-agents/src/mention.rs
The regex crate's Match::start() and Match::end() already return UTF-8 safe byte offsets. If you're seeing panics with multi-byte characters, the root cause is likely elsewhere (e.g., string mutation between regex matching and slicing, or incorrect manual offset calculation).

Check the stack trace from issue #5251 to verify the panic actually originates from this code path.

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/cortex-agents/src/mention.rs
Line: 109:118

Comment:
The regex crate's `Match::start()` and `Match::end()` already return UTF-8 safe byte offsets. If you're seeing panics with multi-byte characters, the root cause is likely elsewhere (e.g., string mutation between regex matching and slicing, or incorrect manual offset calculation).

Check the stack trace from issue #5251 to verify the panic actually originates from this code path.

How can I resolve this? If you propose a fix, please make it concise.

This PR consolidates the following UTF-8 safety fixes: - #31: Use safe UTF-8 slicing in import command base64 extraction - #32: Use safe UTF-8 slicing for session IDs in notifications - #33: Use char-aware string truncation for UTF-8 safety in resume - #35: Use safe UTF-8 slicing for session IDs in lock command - #37: Validate UTF-8 boundaries in mention parsing All changes ensure safe string operations that respect UTF-8 boundaries: - Replaced direct byte slicing with char-aware methods - Added floor_char_boundary checks before slicing - Prevents panics from slicing multi-byte characters

echobt · 2026-02-04T15:41:25Z

Consolidated into #70 - fix: consolidated UTF-8 safety improvements for string slicing

fix(agents): validate UTF-8 boundaries in mention parsing

de6da50

greptile-apps bot reviewed Feb 4, 2026

View reviewed changes

echobt mentioned this pull request Feb 4, 2026

fix: consolidated UTF-8 safety improvements for string slicing #70

Closed

echobt closed this Feb 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agents): validate UTF-8 boundaries in mention parsing #37

fix(agents): validate UTF-8 boundaries in mention parsing #37

Uh oh!

echobt commented Feb 4, 2026

Uh oh!

greptile-apps bot commented Feb 4, 2026

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 4, 2026

Uh oh!

greptile-apps bot Feb 4, 2026

Uh oh!

greptile-apps bot Feb 4, 2026

Uh oh!

greptile-apps bot commented Feb 4, 2026

Uh oh!

echobt commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix(agents): validate UTF-8 boundaries in mention parsing #37

fix(agents): validate UTF-8 boundaries in mention parsing #37

Uh oh!

Conversation

echobt commented Feb 4, 2026

Summary

Problem

Solution

Testing

Uh oh!

greptile-apps bot commented Feb 4, 2026

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 4, 2026

Uh oh!

echobt commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant