feat(memory/chunker): recognize all 6 ATX heading levels (h1-h6) per CommonMark#1881
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a private ChangesATX Heading Detection Generalization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/memory/chunker.rs`:
- Around line 326-337: The test code in function
all_atx_heading_levels_h1_through_h6_split (and related test(s) around the same
area, e.g., the block at lines noted near 355-359) fails rustfmt checks; run
cargo fmt to reformat src/openhuman/memory/chunker.rs, ensure the function
signature, let bindings, iterator chaining and assert_eq! call are formatted per
rustfmt, then add and commit the formatted changes so the CI `cargo fmt --check`
passes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 07dd44ba-3c28-463e-b6b4-015b13a65466
📒 Files selected for processing (1)
src/openhuman/memory/chunker.rs
6760780 to
8f16459
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/memory/chunker.rs`:
- Around line 135-150: The new heading-detection flow in is_atx_heading and
split_on_headings lacks tracing; add stable-prefixed trace/debug logs at key
points: on entry to split_on_headings, each line processed (indicating the line
text or a short snippet and an index), when is_atx_heading returns true/false
(include the heading candidate), and when a section is flushed/pushed (include
current_heading and body lengths). Use the project's tracing/log crate (trace
for per-line, debug for section flush and entry/exit), include a stable prefix
like "chunker:heading:" and correlation fields (e.g., line_idx,
heading_present=true/false, section_count) so grep and downstream tooling can
reliably find these diagnostics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6cc441f7-ba28-4bac-ac48-31d0af3b4c68
📒 Files selected for processing (1)
src/openhuman/memory/chunker.rs
| fn is_atx_heading(line: &str) -> bool { | ||
| const PREFIXES: &[&str] = &["# ", "## ", "### ", "#### ", "##### ", "###### "]; | ||
| PREFIXES.iter().any(|p| line.starts_with(p)) | ||
| } | ||
|
|
||
| /// Identifies markdown ATX headings and groups their following text into | ||
| /// sections. | ||
| fn split_on_headings(text: &str) -> Vec<(Option<String>, String)> { | ||
| let mut sections = Vec::new(); | ||
| let mut current_heading: Option<String> = None; | ||
| let mut current_body = String::new(); | ||
|
|
||
| for line in text.lines() { | ||
| if line.starts_with("# ") || line.starts_with("## ") || line.starts_with("### ") { | ||
| if is_atx_heading(line) { | ||
| if !current_body.trim().is_empty() || current_heading.is_some() { | ||
| sections.push((current_heading.take(), std::mem::take(&mut current_body))); |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Add trace/debug diagnostics to the new heading-detection path.
The new is_atx_heading + split_on_headings branch flow introduces behavior changes but has no log/tracing instrumentation for branch decisions/state transitions. Please add stable-prefixed trace/debug logs around heading detection and section flush points.
As per coding guidelines: "src/**/*.rs: Rust core code must use log / tracing at debug / trace levels for verbose diagnostics on new/changed flows, including entry/exit, branches, ... state transitions, and errors, with stable grep-friendly prefixes and correlation fields."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/memory/chunker.rs` around lines 135 - 150, The new
heading-detection flow in is_atx_heading and split_on_headings lacks tracing;
add stable-prefixed trace/debug logs at key points: on entry to
split_on_headings, each line processed (indicating the line text or a short
snippet and an index), when is_atx_heading returns true/false (include the
heading candidate), and when a section is flushed/pushed (include
current_heading and body lengths). Use the project's tracing/log crate (trace
for per-line, debug for section flush and entry/exit), include a stable prefix
like "chunker:heading:" and correlation fields (e.g., line_idx,
heading_present=true/false, section_count) so grep and downstream tooling can
reliably find these diagnostics.
…CommonMark Previously `split_on_headings` only matched `#`, `##`, `###`, dumping deep-nested doc sections into one oversized chunk. Adds `is_atx_heading` helper covering h1-h6 with trailing-space check, replaces the bug-documenting test with four positive/negative tests, updates the doc comment.
8f16459 to
f986991
Compare
Summary
Extends
split_on_headingsin src/openhuman/memory/chunker.rs from h1-h3 to all six CommonMark ATX heading levels (h1-h6). Adds a smallis_atx_headinghelper, replaces the bug-documenting test with four positive/negative tests.Problem
The chunker only matched
#,##,###as section boundaries. CommonMark valid ATX headings are 1-6 leading#followed by a space. Deep-nested technical docs (API references, multi-level READMEs, anything user-ingested with deeper nesting) dumped entire####+ subtrees into the parent section, producing oversized chunks and the wrongchunk.headingfor retrieval. The previous testdeeply_nested_headings_ignoredasserted the buggy behavior as documentation; it passes clean onupstream/mainat commit02cf2cee, confirming the bug exists exactly as described.git log -Son the originalstarts_with("### ")line shows it came in via PR #52 "Feat/refactor UI code" (32678 additions / 52517 deletions across 100 files), whose description was the blank template and whose comments never mention headings, depth, or the chunker. No recorded rationale for the h1-h3 cap, which reads as an oversight in a massive UI refactor rather than a deliberate design choice. See #1877 for the full investigation.Solution
New private helper backed by a static prefix list:
Naturally enforces both CommonMark rules in one place: 1-6 hashes (the list has exactly 6 entries, so 7+ hashes match no prefix) and trailing space required (every prefix ends with a space, so
###headingmatches none). Zero allocation (static strings baked into the binary).split_on_headingsswaps its inline three-waystarts_withchain for a single call to this helper.The existing
deeply_nested_headings_ignoredtest (whose premise is the bug) is replaced with four new tests:deep_atx_headings_split_through_h6:#### Deep headingproduces its own chunk with the right.headingfield.all_atx_heading_levels_h1_through_h6_split: feeds one heading at each depth and asserts all six.headingfields land correctly.seven_or_more_hashes_are_not_a_heading:####### Foostays as body of the parent section.atx_heading_requires_trailing_space:###NoSpacestays as body.Net diff: +60 -11 in one file. All 22 chunker tests pass locally (
cargo test --package openhuman --lib openhuman::memory::chunker).Submission Checklist
split_on_headingsis exercised by one of the 4 new tests plus the existing tests that cover the h1-h3 paths.docs/TEST-COVERAGE-MATRIX.mdcovers user-visible features; chunker internals are not a row).Closes #1877in the## Relatedsection.Impact
####+ now split at the deeper boundary, so retrieval gets the right granularity and the correctchunk.headingfor surface display.is_atx_headingis a private helper).Related
AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
feat/chunker-deep-headings6760780dValidation Run
cargo test --package openhuman --lib openhuman::memory::chunkerpasses (22/22, 0 failed) locally on this branch.Validation Blocked
command:pre-push hookpnpm rust:check(cargo check --manifest-path src-tauri/Cargo.toml) may still exit 101 on upstreammain(inherited failure documented in PR #1786, codesign script fix). This branch only editssrc/openhuman/memory/chunker.rsso the failure cannot be caused by this change.error:(as above)impact:Pushed with--no-verifyperCLAUDE.mdguidance for hook failures unrelated to the change.Behavior Changes
####-######headings now split at those boundaries, where previously they were glued to the parent section.headingfield for deep-nested doc content.Parity Contract
headingfields). For 7+ hash and no-trailing-space inputs, behavior is now spec-correct.Duplicate / Superseded PR Handling
gh pr list; only PR #1657 "Feat/gmail unsubscribe agent" touchessrc/openhuman/memory/, and a file-level inspection confirmed it does not touchchunker.rs.Summary by CodeRabbit
Bug Fixes
Tests