Skip to content

feat(memory/chunker): recognize all 6 ATX heading levels (h1-h6) per CommonMark#1881

Merged
senamakel merged 1 commit into
tinyhumansai:mainfrom
aregmii:feat/chunker-deep-headings
May 16, 2026
Merged

feat(memory/chunker): recognize all 6 ATX heading levels (h1-h6) per CommonMark#1881
senamakel merged 1 commit into
tinyhumansai:mainfrom
aregmii:feat/chunker-deep-headings

Conversation

@aregmii
Copy link
Copy Markdown
Contributor

@aregmii aregmii commented May 16, 2026

Summary

Extends split_on_headings in src/openhuman/memory/chunker.rs from h1-h3 to all six CommonMark ATX heading levels (h1-h6). Adds a small is_atx_heading helper, replaces the bug-documenting test with four positive/negative tests.

Problem

The chunker only matched #, ##, ### as section boundaries. CommonMark valid ATX headings are 1-6 leading # followed by a space. Deep-nested technical docs (API references, multi-level READMEs, anything user-ingested with deeper nesting) dumped entire ####+ subtrees into the parent section, producing oversized chunks and the wrong chunk.heading for retrieval. The previous test deeply_nested_headings_ignored asserted the buggy behavior as documentation; it passes clean on upstream/main at commit 02cf2cee, confirming the bug exists exactly as described.

git log -S on the original starts_with("### ") line shows it came in via PR #52 "Feat/refactor UI code" (32678 additions / 52517 deletions across 100 files), whose description was the blank template and whose comments never mention headings, depth, or the chunker. No recorded rationale for the h1-h3 cap, which reads as an oversight in a massive UI refactor rather than a deliberate design choice. See #1877 for the full investigation.

Solution

New private helper backed by a static prefix list:

fn is_atx_heading(line: &str) -> bool {
    const PREFIXES: &[&str] = &["# ", "## ", "### ", "#### ", "##### ", "###### "];
    PREFIXES.iter().any(|p| line.starts_with(p))
}

Naturally enforces both CommonMark rules in one place: 1-6 hashes (the list has exactly 6 entries, so 7+ hashes match no prefix) and trailing space required (every prefix ends with a space, so ###heading matches none). Zero allocation (static strings baked into the binary). split_on_headings swaps its inline three-way starts_with chain for a single call to this helper.

The existing deeply_nested_headings_ignored test (whose premise is the bug) is replaced with four new tests:

  • deep_atx_headings_split_through_h6: #### Deep heading produces its own chunk with the right .heading field.
  • all_atx_heading_levels_h1_through_h6_split: feeds one heading at each depth and asserts all six .heading fields land correctly.
  • seven_or_more_hashes_are_not_a_heading: ####### Foo stays as body of the parent section.
  • atx_heading_requires_trailing_space: ###NoSpace stays as body.

Net diff: +60 -11 in one file. All 22 chunker tests pass locally (cargo test --package openhuman --lib openhuman::memory::chunker).

Submission Checklist

  • Tests added (4 new: happy-path h1-h6 split + two failure-path negatives for 7+ hashes and missing trailing space). Replaces the prior bug-documenting test.
  • Diff coverage: every line of the new helper and the changed split_on_headings is exercised by one of the 4 new tests plus the existing tests that cover the h1-h3 paths.
  • N/A: no feature rows affected (docs/TEST-COVERAGE-MATRIX.md covers user-visible features; chunker internals are not a row).
  • N/A: no feature IDs touched.
  • N/A: no external dependencies introduced; stdlib only.
  • N/A: no release-cut surfaces touched.
  • Linked issue closed via Closes #1877 in the ## Related section.

Impact

  • Memory chunks for docs with ####+ now split at the deeper boundary, so retrieval gets the right granularity and the correct chunk.heading for surface display.
  • No behavior change for documents using only h1-h3.
  • No public API change (is_atx_heading is a private helper).

Related


AI Authored PR Metadata (required for Codex/Linear PRs)

N/A: Human-authored, AI-assisted drafting.

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: feat/chunker-deep-headings
  • Commit SHA: 6760780d

Validation Run

  • N/A: no JS/TS changed.
  • N/A: no TS types changed.
  • N/A: no TS to compile.
  • cargo test --package openhuman --lib openhuman::memory::chunker passes (22/22, 0 failed) locally on this branch.
  • N/A: no Tauri code changed.

Validation Blocked

  • command: pre-push hook pnpm rust:check (cargo check --manifest-path src-tauri/Cargo.toml) may still exit 101 on upstream main (inherited failure documented in PR #1786, codesign script fix). This branch only edits src/openhuman/memory/chunker.rs so the failure cannot be caused by this change.
  • error: (as above)
  • impact: Pushed with --no-verify per CLAUDE.md guidance for hook failures unrelated to the change.

Behavior Changes

  • Intended behavior change: documents using ####-###### headings now split at those boundaries, where previously they were glued to the parent section.
  • User-visible effect: better-grained memory chunks and correct heading field for deep-nested doc content.

Parity Contract

  • Legacy behavior preserved: Yes for h1-h3 inputs (same chunks, same heading fields). For 7+ hash and no-trailing-space inputs, behavior is now spec-correct.
  • Guard/fallback/dispatch parity checks: N/A; no dispatch path touched.

Duplicate / Superseded PR Handling

  • Duplicate PR(s): None. Overlap check at branch-push time: 28 open PRs scanned via gh pr list; only PR #1657 "Feat/gmail unsubscribe agent" touches src/openhuman/memory/, and a file-level inspection confirmed it does not touch chunker.rs.
  • Canonical PR: This one.
  • Resolution: N/A.

Summary by CodeRabbit

  • Bug Fixes

    • Improved Markdown chunking to recognize ATX heading levels 1–6 so deeper headings now start new sections; lines with seven hashes are not treated as headings and headings must include a trailing space.
  • Tests

    • Added/updated tests covering h1–h6 splitting, exclusion of 7+ hashes, and trailing-space heading validation.

Review Change Stack

@aregmii aregmii requested a review from a team May 16, 2026 00:25
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 16, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: aa3a21b5-1e47-419a-90d4-153b95c12b57

📥 Commits

Reviewing files that changed from the base of the PR and between 8f16459 and f986991.

📒 Files selected for processing (1)
  • src/openhuman/memory/chunker.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/openhuman/memory/chunker.rs

📝 Walkthrough

Walkthrough

Adds a private is_atx_heading predicate (1–6 # plus space), updates split_on_headings to use it with extra debug logging, and replaces tests to cover h1–h6 plus negative cases (7+ hashes and missing trailing space).

Changes

ATX Heading Detection Generalization

Layer / File(s) Summary
is_atx_heading helper and split_on_headings integration
src/openhuman/memory/chunker.rs
Adds a private is_atx_heading predicate validating 1–6 leading # characters then a space; split_on_headings now calls this helper and logs flush/finalization lengths.
Test coverage for all ATX heading depths
src/openhuman/memory/chunker.rs
Removes the deeply_nested_headings_ignored assertion and adds tests for splitting on h1–h6, rejecting #######, and requiring the trailing space.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit hops through heading trees,
From h1 peaks down to h6 degrees,
No more stuck where three once reigned,
The chunker splits and order's gained,
🐇📚

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(memory/chunker): recognize all 6 ATX heading levels (h1-h6) per CommonMark' directly and clearly summarizes the main change: extending heading level recognition from 3 (h1-h3) to all 6 valid ATX heading levels per CommonMark specification.
Linked Issues check ✅ Passed The PR fully implements all coding requirements from issue #1877: recognizes h1-h6 headings with trailing space requirement, rejects 7+ hashes, replaces the buggy test, and adds comprehensive test coverage for all six levels plus two negative cases.
Out of Scope Changes check ✅ Passed All changes in the PR are directly scoped to issue #1877. The modifications to split_on_headings helper, test replacements, and debug logging instrumentation are all aligned with the stated objectives and acceptance criteria.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/chunker.rs`:
- Around line 326-337: The test code in function
all_atx_heading_levels_h1_through_h6_split (and related test(s) around the same
area, e.g., the block at lines noted near 355-359) fails rustfmt checks; run
cargo fmt to reformat src/openhuman/memory/chunker.rs, ensure the function
signature, let bindings, iterator chaining and assert_eq! call are formatted per
rustfmt, then add and commit the formatted changes so the CI `cargo fmt --check`
passes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 07dd44ba-3c28-463e-b6b4-015b13a65466

📥 Commits

Reviewing files that changed from the base of the PR and between 02cf2ce and 6760780.

📒 Files selected for processing (1)
  • src/openhuman/memory/chunker.rs

Comment thread src/openhuman/memory/chunker.rs
@aregmii aregmii force-pushed the feat/chunker-deep-headings branch from 6760780 to 8f16459 Compare May 16, 2026 01:02
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/memory/chunker.rs`:
- Around line 135-150: The new heading-detection flow in is_atx_heading and
split_on_headings lacks tracing; add stable-prefixed trace/debug logs at key
points: on entry to split_on_headings, each line processed (indicating the line
text or a short snippet and an index), when is_atx_heading returns true/false
(include the heading candidate), and when a section is flushed/pushed (include
current_heading and body lengths). Use the project's tracing/log crate (trace
for per-line, debug for section flush and entry/exit), include a stable prefix
like "chunker:heading:" and correlation fields (e.g., line_idx,
heading_present=true/false, section_count) so grep and downstream tooling can
reliably find these diagnostics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6cc441f7-ba28-4bac-ac48-31d0af3b4c68

📥 Commits

Reviewing files that changed from the base of the PR and between 6760780 and 8f16459.

📒 Files selected for processing (1)
  • src/openhuman/memory/chunker.rs

Comment on lines +135 to 150
fn is_atx_heading(line: &str) -> bool {
const PREFIXES: &[&str] = &["# ", "## ", "### ", "#### ", "##### ", "###### "];
PREFIXES.iter().any(|p| line.starts_with(p))
}

/// Identifies markdown ATX headings and groups their following text into
/// sections.
fn split_on_headings(text: &str) -> Vec<(Option<String>, String)> {
let mut sections = Vec::new();
let mut current_heading: Option<String> = None;
let mut current_body = String::new();

for line in text.lines() {
if line.starts_with("# ") || line.starts_with("## ") || line.starts_with("### ") {
if is_atx_heading(line) {
if !current_body.trim().is_empty() || current_heading.is_some() {
sections.push((current_heading.take(), std::mem::take(&mut current_body)));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add trace/debug diagnostics to the new heading-detection path.

The new is_atx_heading + split_on_headings branch flow introduces behavior changes but has no log/tracing instrumentation for branch decisions/state transitions. Please add stable-prefixed trace/debug logs around heading detection and section flush points.

As per coding guidelines: "src/**/*.rs: Rust core code must use log / tracing at debug / trace levels for verbose diagnostics on new/changed flows, including entry/exit, branches, ... state transitions, and errors, with stable grep-friendly prefixes and correlation fields."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/openhuman/memory/chunker.rs` around lines 135 - 150, The new
heading-detection flow in is_atx_heading and split_on_headings lacks tracing;
add stable-prefixed trace/debug logs at key points: on entry to
split_on_headings, each line processed (indicating the line text or a short
snippet and an index), when is_atx_heading returns true/false (include the
heading candidate), and when a section is flushed/pushed (include
current_heading and body lengths). Use the project's tracing/log crate (trace
for per-line, debug for section flush and entry/exit), include a stable prefix
like "chunker:heading:" and correlation fields (e.g., line_idx,
heading_present=true/false, section_count) so grep and downstream tooling can
reliably find these diagnostics.

…CommonMark

Previously `split_on_headings` only matched `#`, `##`, `###`, dumping deep-nested doc sections into one oversized chunk. Adds `is_atx_heading` helper covering h1-h6 with trailing-space check, replaces the bug-documenting test with four positive/negative tests, updates the doc comment.
@aregmii aregmii force-pushed the feat/chunker-deep-headings branch from 8f16459 to f986991 Compare May 16, 2026 01:20
@senamakel senamakel merged commit 6eb18de into tinyhumansai:main May 16, 2026
24 checks passed
@aregmii aregmii deleted the feat/chunker-deep-headings branch May 16, 2026 03:46
AusAgentSmith pushed a commit to AusAgentSmith/openhuman that referenced this pull request May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat (memory/chunker): markdown chunker only recognizes h1-h3 ATX headings; h4-h6 get dumped into oversized chunks

2 participants