Skip to content

feat(documents): add Section.markdown() accessor#833

Open
HonzaCuhel wants to merge 1 commit into
dgunning:mainfrom
HonzaCuhel:feat/section-markdown-accessor
Open

feat(documents): add Section.markdown() accessor#833
HonzaCuhel wants to merge 1 commit into
dgunning:mainfrom
HonzaCuhel:feat/section-markdown-accessor

Conversation

@HonzaCuhel
Copy link
Copy Markdown
Contributor

Summary

Section.text() flattens tables and bullet lists; Filing.markdown() preserves structure but is whole-document. This adds Section.markdown() — the missing per-section markdown accessor — so per-item chunkers / RAG pipelines can get item-aware and structure-preserving markdown in one call.

Why

Real-world example: tenq[\"Part II, Item 2\"] (Issuer Purchases of Equity Securities table) returns vertically-stacked cell text. filing.markdown() would render the same content as pipe-delimited markdown — but you lose the per-item slice. Today downstream consumers have to choose between item-awareness and structure preservation.

section.markdown() removes the trade-off: one call returns markdown scoped to one section.

Implementation

Heading/pattern-based sections — render the cached node tree via MarkdownRenderer().render_node(self.node). Tables emit pipe syntax, bullet lists keep markers. Same boundary-artifact cleanup that text() applies runs on the rendered output, extended to handle markdown-decorated bleed-in artifacts (# Item 5, **Item 5\\.**, # PART II\\n\\n# Item 5).

TOC-based sections (where node.children is empty and content is fetched lazily from _html_source) — fall back to Section.text(). Iterative review surfaced multiple correctness landmines in TOC-section HTML extraction: next-section heading leaks via nested anchors, shared-wrapper LCA computation, same-anchor start/end boundaries, inline anchor wrappers, last-section-in-wrapper leaks, and table-row-bounded anchors losing <table>/<tbody> wrappers. Rather than ship a partial fix for any one, the TOC path is conservative: no regression vs text(), no new landmines. Tracked as a follow-up.

Verification

  • 13 new unit tests in tests/test_section_markdown.py cover: table-pipe preservation, bullet-list preservation, TOC fall-back contract, boundary-artifact cleanup across plain/escaped/heading-decorated/bold-decorated/combined-PART+Item variants, idempotency, and a negative control asserting text() does NOT preserve pipes (so the contrast is meaningful).
  • 100/100 broader section-detection regression tests still pass with EDGAR_IDENTITY set (test_documents, test_10q_section_detection, test_sections_membership, test_section_detection_edge_cases, test_section_tables_toc_fix).
  • 11 rounds of codex review — each round caught a real edge case, all addressed except the documented TOC fall-back (which would re-open the whole TOC extraction class of issues).

Test plan

  • CI runs the full repo test suite
  • Maintainer confirms the API name + scope
  • If maintainer wants TOC structural rendering shipped together, can split into a follow-up PR with the LCA-walk approach (incomplete vs. table-row anchors but covers the more common cases)

🤖 Generated with Claude Code

`Section.text()` walks the HTML subtree and emits newline-joined cell
content — tables and bullet lists are flattened to space/newline soup,
losing column structure and list markers. The whole-document
`Filing.markdown()` (via `Document.to_markdown()`) preserves table
pipe syntax and list markers but is whole-document only — there's no
per-item slice.

This means per-item chunkers / RAG pipelines have to choose between:

- the cheap per-item path (`typed[item_name]` → `Section.text()`) which
  is item-aware but flattens; or
- the faithful whole-document path (`filing.markdown()`) which preserves
  structure but loses item boundaries.

`Section.markdown()` closes that gap for heading/pattern-detected
sections: same scope as `text()` (one section) but the same renderer
as `Document.to_markdown` so tables and lists keep their syntax.

Implementation
--------------
- Heading/pattern-based sections: render the cached node tree directly
  via `MarkdownRenderer().render_node(self.node)`. Tables emit pipe
  syntax, bullet lists keep their markers. Same boundary-artifact
  cleanup that `text()` applies runs on the rendered output, extended
  to handle markdown-decorated bleed-in artifacts (e.g. `# Item 5`,
  `**Item 5\.**`, `# PART II\n\n# Item 5`).
- TOC-based sections (where `node.children` is empty and content is
  fetched lazily from the original HTML): fall back to `Section.text()`.
  Iterative codex review surfaced multiple edge cases for TOC-section
  HTML extraction (next-section heading leaks via nested anchors;
  shared-wrapper LCA computation; same-anchor start/end boundaries;
  inline anchor wrappers; last-section-in-wrapper leaks; table-row-
  bounded anchors losing `<table>`/`<tbody>` wrappers). Rather than
  ship a partial fix for any one of these, the TOC path is conservative
  and returns the same text the existing API already produces — no
  regression, no new correctness landmines. Full TOC markdown support
  is tracked as a follow-up.

Verification
------------
- 13 new unit tests cover table-pipe preservation, bullet-list
  preservation, the TOC fall-back contract, boundary-artifact cleanup
  in plain text + markdown-escaped + markdown-decorated + combined
  PART+Item forms, idempotency, and a negative control asserting
  `text()` does NOT preserve pipes (so the contrast is meaningful).
- 100/100 broader section-detection regression tests pass with
  `EDGAR_IDENTITY` set.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant