Skip to content

Conversation

@echobt
Copy link
Contributor

@echobt echobt commented Feb 4, 2026

Summary

Fixes #5251 - Agent mention parsing panics on multi-byte UTF-8.

Problem

The mention parser uses direct byte-offset string slicing that can panic when the offset falls inside a multi-byte UTF-8 character.

Problematic patterns like &text[..mention.start] where mention.start is a byte offset that could be inside a multi-byte character.

Solution

Added two helper functions for safe UTF-8 boundary-aware string slicing:

  • safe_slice_up_to(text, pos) - Returns slice from start to valid boundary at or before pos
  • safe_slice_from(text, pos) - Returns slice from valid boundary at or after pos to end

Applied these functions to all three locations that previously used direct slicing:

  1. extract_mention_and_text - for building remaining text
  2. starts_with_mention - for checking prefix
  3. parse_message_for_agent - for checking if mention is at start

Testing

  • Added 8 new tests covering multi-byte UTF-8 scenarios:
    • Japanese characters (3-byte UTF-8)
    • Emoji (4-byte UTF-8)
    • Mixed ASCII, CJK, and emoji content
  • All 17 mention tests pass
  • Verified with cargo check -p cortex-agents

@greptile-apps
Copy link

greptile-apps bot commented Feb 4, 2026

Greptile Overview

Greptile Summary

Adds UTF-8 boundary validation to prevent panics when slicing strings at byte offsets that fall inside multi-byte characters. Introduces two helper functions (safe_slice_up_to and safe_slice_from) that adjust offsets to the nearest character boundary.

Key changes:

  • Added safe_slice_up_to: searches backward for valid boundary
  • Added safe_slice_from: searches forward for valid boundary
  • Applied helpers to 3 slicing locations in mention parsing
  • Added 8 new tests covering Japanese, emoji, and mixed UTF-8

Critical concern: The implementation silently adjusts boundaries rather than failing fast. Since Rust's regex crate already guarantees UTF-8 safe offsets from Match::start() and Match::end(), boundary adjustment masks potential bugs elsewhere. If an invalid offset occurs, it indicates a real issue that should be surfaced with a clear panic message rather than hidden by silently skipping characters.

Confidence Score: 3/5

  • Safe to merge but implementation approach should be reconsidered
  • The code prevents panics and adds comprehensive test coverage, but silently adjusting UTF-8 boundaries can cause data loss (characters being skipped). Since the regex crate guarantees safe offsets, boundary violations indicate bugs that should be surfaced, not masked. The current approach trades correctness for robustness.
  • Review the semantic behavior of boundary adjustment in src/cortex-agents/src/mention.rs

Important Files Changed

Filename Overview
src/cortex-agents/src/mention.rs Added UTF-8 safe slicing helpers to prevent panics on multi-byte characters. Implementation has semantic issues with boundary adjustment behavior.

Sequence Diagram

sequenceDiagram
    participant User
    participant Parser as parse_message_for_agent
    participant Extract as extract_mention_and_text
    participant Find as find_first_valid_mention
    participant Regex as MENTION_REGEX
    participant SafeSlice as safe_slice_up_to/safe_slice_from
    
    User->>Parser: "日本語 @general task"
    Parser->>Extract: extract_mention_and_text(text, valid_agents)
    Extract->>Find: find_first_valid_mention(text, valid_agents)
    Find->>Regex: captures_iter(text)
    Regex-->>Find: AgentMention {start: 12, end: 20}
    Note over Regex: Regex guarantees UTF-8 safe offsets
    Find-->>Extract: Some(AgentMention)
    Extract->>SafeSlice: safe_slice_up_to(text, 12)
    Note over SafeSlice: Checks is_char_boundary(12)
    SafeSlice-->>Extract: "日本語 "
    Extract->>SafeSlice: safe_slice_from(text, 20)
    SafeSlice-->>Extract: " task"
    Extract-->>Parser: Some((mention, "日本語  task"))
    Parser->>SafeSlice: safe_slice_up_to(text, 12).trim().is_empty()
    SafeSlice-->>Parser: false
    Parser-->>User: ParsedAgentMessage {should_invoke_task: false}
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +25 to +37
fn safe_slice_up_to(text: &str, pos: usize) -> &str {
if pos >= text.len() {
return text;
}
if text.is_char_boundary(pos) {
return &text[..pos];
}
// Find the nearest valid boundary by searching backwards
let mut valid_pos = pos;
while valid_pos > 0 && !text.is_char_boundary(valid_pos) {
valid_pos -= 1;
}
&text[..valid_pos]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusting boundaries by searching backward/forward changes the semantic meaning of the slice. When mention.start or mention.end point inside a multi-byte character, this indicates the regex library returned invalid byte offsets.

The safer approach is to panic with a clear error message rather than silently returning incorrect data:

Suggested change
fn safe_slice_up_to(text: &str, pos: usize) -> &str {
if pos >= text.len() {
return text;
}
if text.is_char_boundary(pos) {
return &text[..pos];
}
// Find the nearest valid boundary by searching backwards
let mut valid_pos = pos;
while valid_pos > 0 && !text.is_char_boundary(valid_pos) {
valid_pos -= 1;
}
&text[..valid_pos]
fn safe_slice_up_to(text: &str, pos: usize) -> &str {
if pos >= text.len() {
return text;
}
if text.is_char_boundary(pos) {
return &text[..pos];
}
panic!(
"Invalid UTF-8 boundary at position {} in string of length {}. \
This indicates a bug in the mention regex parsing.",
pos, text.len()
);
}

Rust's regex crate already guarantees UTF-8 safe offsets, so if this condition occurs it's a critical bug that should be surfaced, not masked.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cortex-agents/src/mention.rs
Line: 25:37

Comment:
Adjusting boundaries by searching backward/forward changes the semantic meaning of the slice. When `mention.start` or `mention.end` point inside a multi-byte character, this indicates the regex library returned invalid byte offsets.

The safer approach is to panic with a clear error message rather than silently returning incorrect data:

```suggestion
fn safe_slice_up_to(text: &str, pos: usize) -> &str {
    if pos >= text.len() {
        return text;
    }
    if text.is_char_boundary(pos) {
        return &text[..pos];
    }
    panic!(
        "Invalid UTF-8 boundary at position {} in string of length {}. \
         This indicates a bug in the mention regex parsing.",
        pos, text.len()
    );
}
```

Rust's `regex` crate already guarantees UTF-8 safe offsets, so if this condition occurs it's a critical bug that should be surfaced, not masked.

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +45 to +57
fn safe_slice_from(text: &str, pos: usize) -> &str {
if pos >= text.len() {
return "";
}
if text.is_char_boundary(pos) {
return &text[pos..];
}
// Find the nearest valid boundary by searching forwards
let mut valid_pos = pos;
while valid_pos < text.len() && !text.is_char_boundary(valid_pos) {
valid_pos += 1;
}
&text[valid_pos..]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as safe_slice_up_to - adjusting boundaries forward can skip characters that should be included. Should panic instead:

Suggested change
fn safe_slice_from(text: &str, pos: usize) -> &str {
if pos >= text.len() {
return "";
}
if text.is_char_boundary(pos) {
return &text[pos..];
}
// Find the nearest valid boundary by searching forwards
let mut valid_pos = pos;
while valid_pos < text.len() && !text.is_char_boundary(valid_pos) {
valid_pos += 1;
}
&text[valid_pos..]
fn safe_slice_from(text: &str, pos: usize) -> &str {
if pos >= text.len() {
return "";
}
if text.is_char_boundary(pos) {
return &text[pos..];
}
panic!(
"Invalid UTF-8 boundary at position {} in string of length {}. \
This indicates a bug in the mention regex parsing.",
pos, text.len()
);
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cortex-agents/src/mention.rs
Line: 45:57

Comment:
Same issue as `safe_slice_up_to` - adjusting boundaries forward can skip characters that should be included. Should panic instead:

```suggestion
fn safe_slice_from(text: &str, pos: usize) -> &str {
    if pos >= text.len() {
        return "";
    }
    if text.is_char_boundary(pos) {
        return &text[pos..];
    }
    panic!(
        "Invalid UTF-8 boundary at position {} in string of length {}. \
         This indicates a bug in the mention regex parsing.",
        pos, text.len()
    );
}
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +383 to +390
#[test]
fn test_safe_slice_from_multibyte() {
let text = "こんにちは";
assert_eq!(safe_slice_from(text, 3), "んにちは"); // Valid boundary
// Position 4 is inside second character, should skip to position 6
assert_eq!(safe_slice_from(text, 4), "にちは");
assert_eq!(safe_slice_from(text, 5), "にちは");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test demonstrates the semantic issue. Position 4 is inside "ん" (bytes 3-5), so calling safe_slice_from(text, 4) skips forward to byte 6, which loses the character "ん" entirely.

Expected: panic or error indicating invalid offset
Actual: silently skips character

If mention.end pointed to byte 4, the remaining text after the mention would incorrectly exclude "ん".

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cortex-agents/src/mention.rs
Line: 383:390

Comment:
This test demonstrates the semantic issue. Position 4 is inside "ん" (bytes 3-5), so calling `safe_slice_from(text, 4)` skips forward to byte 6, which loses the character "ん" entirely.

Expected: panic or error indicating invalid offset
Actual: silently skips character

If `mention.end` pointed to byte 4, the remaining text after the mention would incorrectly exclude "ん".

How can I resolve this? If you propose a fix, please make it concise.

@greptile-apps
Copy link

greptile-apps bot commented Feb 4, 2026

Additional Comments (1)

src/cortex-agents/src/mention.rs
The regex crate's Match::start() and Match::end() already return UTF-8 safe byte offsets. If you're seeing panics with multi-byte characters, the root cause is likely elsewhere (e.g., string mutation between regex matching and slicing, or incorrect manual offset calculation).

Check the stack trace from issue #5251 to verify the panic actually originates from this code path.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cortex-agents/src/mention.rs
Line: 109:118

Comment:
The regex crate's `Match::start()` and `Match::end()` already return UTF-8 safe byte offsets. If you're seeing panics with multi-byte characters, the root cause is likely elsewhere (e.g., string mutation between regex matching and slicing, or incorrect manual offset calculation).

Check the stack trace from issue #5251 to verify the panic actually originates from this code path.

How can I resolve this? If you propose a fix, please make it concise.

echobt added a commit that referenced this pull request Feb 4, 2026
This PR consolidates the following UTF-8 safety fixes:
- #31: Use safe UTF-8 slicing in import command base64 extraction
- #32: Use safe UTF-8 slicing for session IDs in notifications
- #33: Use char-aware string truncation for UTF-8 safety in resume
- #35: Use safe UTF-8 slicing for session IDs in lock command
- #37: Validate UTF-8 boundaries in mention parsing

All changes ensure safe string operations that respect UTF-8 boundaries:
- Replaced direct byte slicing with char-aware methods
- Added floor_char_boundary checks before slicing
- Prevents panics from slicing multi-byte characters
@echobt
Copy link
Contributor Author

echobt commented Feb 4, 2026

Consolidated into #70 - fix: consolidated UTF-8 safety improvements for string slicing

@echobt echobt closed this Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant