feat(memory/chunker): recognize all 6 ATX heading levels (h1-h6) per CommonMark by aregmii · Pull Request #1881 · tinyhumansai/openhuman

aregmii · 2026-05-16T00:25:11Z

Summary

Extends split_on_headings in src/openhuman/memory/chunker.rs from h1-h3 to all six CommonMark ATX heading levels (h1-h6). Adds a small is_atx_heading helper, replaces the bug-documenting test with four positive/negative tests.

Problem

The chunker only matched #, ##, ### as section boundaries. CommonMark valid ATX headings are 1-6 leading # followed by a space. Deep-nested technical docs (API references, multi-level READMEs, anything user-ingested with deeper nesting) dumped entire ####+ subtrees into the parent section, producing oversized chunks and the wrong chunk.heading for retrieval. The previous test deeply_nested_headings_ignored asserted the buggy behavior as documentation; it passes clean on upstream/main at commit 02cf2cee, confirming the bug exists exactly as described.

git log -S on the original starts_with("### ") line shows it came in via PR #52 "Feat/refactor UI code" (32678 additions / 52517 deletions across 100 files), whose description was the blank template and whose comments never mention headings, depth, or the chunker. No recorded rationale for the h1-h3 cap, which reads as an oversight in a massive UI refactor rather than a deliberate design choice. See #1877 for the full investigation.

Solution

New private helper backed by a static prefix list:

fn is_atx_heading(line: &str) -> bool {
    const PREFIXES: &[&str] = &["# ", "## ", "### ", "#### ", "##### ", "###### "];
    PREFIXES.iter().any(|p| line.starts_with(p))
}

Naturally enforces both CommonMark rules in one place: 1-6 hashes (the list has exactly 6 entries, so 7+ hashes match no prefix) and trailing space required (every prefix ends with a space, so ###heading matches none). Zero allocation (static strings baked into the binary). split_on_headings swaps its inline three-way starts_with chain for a single call to this helper.

The existing deeply_nested_headings_ignored test (whose premise is the bug) is replaced with four new tests:

deep_atx_headings_split_through_h6: #### Deep heading produces its own chunk with the right .heading field.
all_atx_heading_levels_h1_through_h6_split: feeds one heading at each depth and asserts all six .heading fields land correctly.
seven_or_more_hashes_are_not_a_heading: ####### Foo stays as body of the parent section.
atx_heading_requires_trailing_space: ###NoSpace stays as body.

Net diff: +60 -11 in one file. All 22 chunker tests pass locally (cargo test --package openhuman --lib openhuman::memory::chunker).

Submission Checklist

Tests added (4 new: happy-path h1-h6 split + two failure-path negatives for 7+ hashes and missing trailing space). Replaces the prior bug-documenting test.
Diff coverage: every line of the new helper and the changed split_on_headings is exercised by one of the 4 new tests plus the existing tests that cover the h1-h3 paths.
N/A: no feature rows affected (docs/TEST-COVERAGE-MATRIX.md covers user-visible features; chunker internals are not a row).
N/A: no feature IDs touched.
N/A: no external dependencies introduced; stdlib only.
N/A: no release-cut surfaces touched.
Linked issue closed via Closes #1877 in the ## Related section.

Impact

Memory chunks for docs with ####+ now split at the deeper boundary, so retrieval gets the right granularity and the correct chunk.heading for surface display.
No behavior change for documents using only h1-h3.
No public API change (is_atx_heading is a private helper).

Closes: feat (memory/chunker): markdown chunker only recognizes h1-h3 ATX headings; h4-h6 get dumped into oversized chunks #1877
Follow-ups already on the contribution roadmap (separate PRs): chunker single-long-line splitter, code-fence-aware splitting so deep headings inside a ``` block aren't misinterpreted.

AI Authored PR Metadata (required for Codex/Linear PRs)

N/A: Human-authored, AI-assisted drafting.

Linear Issue

Key: N/A
URL: N/A

Commit & Branch

Branch: feat/chunker-deep-headings
Commit SHA: 6760780d

Validation Run

N/A: no JS/TS changed.
N/A: no TS types changed.
N/A: no TS to compile.
cargo test --package openhuman --lib openhuman::memory::chunker passes (22/22, 0 failed) locally on this branch.
N/A: no Tauri code changed.

Validation Blocked

command: pre-push hook pnpm rust:check (cargo check --manifest-path src-tauri/Cargo.toml) may still exit 101 on upstream main (inherited failure documented in PR #1786, codesign script fix). This branch only edits src/openhuman/memory/chunker.rs so the failure cannot be caused by this change.
error: (as above)
impact: Pushed with --no-verify per CLAUDE.md guidance for hook failures unrelated to the change.

Behavior Changes

Intended behavior change: documents using ####-###### headings now split at those boundaries, where previously they were glued to the parent section.
User-visible effect: better-grained memory chunks and correct heading field for deep-nested doc content.

Parity Contract

Legacy behavior preserved: Yes for h1-h3 inputs (same chunks, same heading fields). For 7+ hash and no-trailing-space inputs, behavior is now spec-correct.
Guard/fallback/dispatch parity checks: N/A; no dispatch path touched.

Duplicate / Superseded PR Handling

Duplicate PR(s): None. Overlap check at branch-push time: 28 open PRs scanned via gh pr list; only PR #1657 "Feat/gmail unsubscribe agent" touches src/openhuman/memory/, and a file-level inspection confirmed it does not touch chunker.rs.
Canonical PR: This one.
Resolution: N/A.

Summary by CodeRabbit

Bug Fixes
- Improved Markdown chunking to recognize ATX heading levels 1–6 so deeper headings now start new sections; lines with seven hashes are not treated as headings and headings must include a trailing space.
Tests
- Added/updated tests covering h1–h6 splitting, exclusion of 7+ hashes, and trailing-space heading validation.

coderabbitai · 2026-05-16T00:25:24Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: aa3a21b5-1e47-419a-90d4-153b95c12b57

📥 Commits

Reviewing files that changed from the base of the PR and between 8f16459 and f986991.

📒 Files selected for processing (1)

src/openhuman/memory/chunker.rs

🚧 Files skipped from review as they are similar to previous changes (1)

src/openhuman/memory/chunker.rs

📝 Walkthrough

Walkthrough

Adds a private is_atx_heading predicate (1–6 # plus space), updates split_on_headings to use it with extra debug logging, and replaces tests to cover h1–h6 plus negative cases (7+ hashes and missing trailing space).

Changes

ATX Heading Detection Generalization

Layer / File(s)	Summary
is_atx_heading helper and split_on_headings integration `src/openhuman/memory/chunker.rs`	Adds a private `is_atx_heading` predicate validating 1–6 leading `#` characters then a space; `split_on_headings` now calls this helper and logs flush/finalization lengths.
Test coverage for all ATX heading depths `src/openhuman/memory/chunker.rs`	Removes the `deeply_nested_headings_ignored` assertion and adds tests for splitting on h1–h6, rejecting `#######`, and requiring the trailing space.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit hops through heading trees,
From h1 peaks down to h6 degrees,
No more stuck where three once reigned,
The chunker splits and order's gained,
🐇📚

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(memory/chunker): recognize all 6 ATX heading levels (h1-h6) per CommonMark' directly and clearly summarizes the main change: extending heading level recognition from 3 (h1-h3) to all 6 valid ATX heading levels per CommonMark specification.
Linked Issues check	✅ Passed	The PR fully implements all coding requirements from issue `#1877`: recognizes h1-h6 headings with trailing space requirement, rejects 7+ hashes, replaces the buggy test, and adds comprehensive test coverage for all six levels plus two negative cases.
Out of Scope Changes check	✅ Passed	All changes in the PR are directly scoped to issue `#1877`. The modifications to split_on_headings helper, test replacements, and debug logging instrumentation are all aligned with the stated objectives and acceptance criteria.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/chunker.rs`:
- Around line 326-337: The test code in function
all_atx_heading_levels_h1_through_h6_split (and related test(s) around the same
area, e.g., the block at lines noted near 355-359) fails rustfmt checks; run
cargo fmt to reformat src/openhuman/memory/chunker.rs, ensure the function
signature, let bindings, iterator chaining and assert_eq! call are formatted per
rustfmt, then add and commit the formatted changes so the CI `cargo fmt --check`
passes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 07dd44ba-3c28-463e-b6b4-015b13a65466

📥 Commits

Reviewing files that changed from the base of the PR and between 02cf2ce and 6760780.

📒 Files selected for processing (1)

src/openhuman/memory/chunker.rs

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/chunker.rs`:
- Around line 135-150: The new heading-detection flow in is_atx_heading and
split_on_headings lacks tracing; add stable-prefixed trace/debug logs at key
points: on entry to split_on_headings, each line processed (indicating the line
text or a short snippet and an index), when is_atx_heading returns true/false
(include the heading candidate), and when a section is flushed/pushed (include
current_heading and body lengths). Use the project's tracing/log crate (trace
for per-line, debug for section flush and entry/exit), include a stable prefix
like "chunker:heading:" and correlation fields (e.g., line_idx,
heading_present=true/false, section_count) so grep and downstream tooling can
reliably find these diagnostics.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6cc441f7-ba28-4bac-ac48-31d0af3b4c68

📥 Commits

Reviewing files that changed from the base of the PR and between 6760780 and 8f16459.

📒 Files selected for processing (1)

src/openhuman/memory/chunker.rs

coderabbitai · 2026-05-16T01:05:01Z

+fn is_atx_heading(line: &str) -> bool {
+    const PREFIXES: &[&str] = &["# ", "## ", "### ", "#### ", "##### ", "###### "];
+    PREFIXES.iter().any(|p| line.starts_with(p))
+}
+
+/// Identifies markdown ATX headings and groups their following text into
+/// sections.
 fn split_on_headings(text: &str) -> Vec<(Option<String>, String)> {
    let mut sections = Vec::new();
    let mut current_heading: Option<String> = None;
    let mut current_body = String::new();

    for line in text.lines() {
-        if line.starts_with("# ") || line.starts_with("## ") || line.starts_with("### ") {
+        if is_atx_heading(line) {
            if !current_body.trim().is_empty() || current_heading.is_some() {
                sections.push((current_heading.take(), std::mem::take(&mut current_body)));


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add trace/debug diagnostics to the new heading-detection path.

The new is_atx_heading + split_on_headings branch flow introduces behavior changes but has no log/tracing instrumentation for branch decisions/state transitions. Please add stable-prefixed trace/debug logs around heading detection and section flush points.

As per coding guidelines: "src/**/*.rs: Rust core code must use log / tracing at debug / trace levels for verbose diagnostics on new/changed flows, including entry/exit, branches, ... state transitions, and errors, with stable grep-friendly prefixes and correlation fields."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/memory/chunker.rs` around lines 135 - 150, The new heading-detection flow in is_atx_heading and split_on_headings lacks tracing; add stable-prefixed trace/debug logs at key points: on entry to split_on_headings, each line processed (indicating the line text or a short snippet and an index), when is_atx_heading returns true/false (include the heading candidate), and when a section is flushed/pushed (include current_heading and body lengths). Use the project's tracing/log crate (trace for per-line, debug for section flush and entry/exit), include a stable prefix like "chunker:heading:" and correlation fields (e.g., line_idx, heading_present=true/false, section_count) so grep and downstream tooling can reliably find these diagnostics.

…CommonMark Previously `split_on_headings` only matched `#`, `##`, `###`, dumping deep-nested doc sections into one oversized chunk. Adds `is_atx_heading` helper covering h1-h6 with trailing-space check, replaces the bug-documenting test with four positive/negative tests, updates the doc comment.

…CommonMark (tinyhumansai#1881)

aregmii requested a review from a team May 16, 2026 00:25

coderabbitai Bot requested changes May 16, 2026

View reviewed changes

Comment thread src/openhuman/memory/chunker.rs

aregmii force-pushed the feat/chunker-deep-headings branch from 6760780 to 8f16459 Compare May 16, 2026 01:02

coderabbitai Bot requested changes May 16, 2026

View reviewed changes

aregmii force-pushed the feat/chunker-deep-headings branch from 8f16459 to f986991 Compare May 16, 2026 01:20

senamakel merged commit 6eb18de into tinyhumansai:main May 16, 2026
24 checks passed

aregmii deleted the feat/chunker-deep-headings branch May 16, 2026 03:46

AusAgentSmith pushed a commit to AusAgentSmith/openhuman that referenced this pull request May 23, 2026

feat(memory/chunker): recognize all 6 ATX heading levels (h1-h6) per …

537404f

…CommonMark (tinyhumansai#1881)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(memory/chunker): recognize all 6 ATX heading levels (h1-h6) per CommonMark#1881

feat(memory/chunker): recognize all 6 ATX heading levels (h1-h6) per CommonMark#1881
senamakel merged 1 commit into
tinyhumansai:mainfrom
aregmii:feat/chunker-deep-headings

aregmii commented May 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 16, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aregmii commented May 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Submission Checklist

Impact

Related

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Commit & Branch

Validation Run

Validation Blocked

Behavior Changes

Parity Contract

Duplicate / Superseded PR Handling

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aregmii commented May 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 16, 2026 •

edited

Loading