feat(documents): add Section.markdown() accessor by HonzaCuhel · Pull Request #833 · dgunning/edgartools

HonzaCuhel · 2026-05-25T17:09:53Z

Summary

Section.text() flattens tables and bullet lists; Filing.markdown() preserves structure but is whole-document. This adds Section.markdown() — the missing per-section markdown accessor — so per-item chunkers / RAG pipelines can get item-aware and structure-preserving markdown in one call.

Why

Real-world example: tenq[\"Part II, Item 2\"] (Issuer Purchases of Equity Securities table) returns vertically-stacked cell text. filing.markdown() would render the same content as pipe-delimited markdown — but you lose the per-item slice. Today downstream consumers have to choose between item-awareness and structure preservation.

section.markdown() removes the trade-off: one call returns markdown scoped to one section.

Implementation

Heading/pattern-based sections — render the cached node tree via MarkdownRenderer().render_node(self.node). Tables emit pipe syntax, bullet lists keep markers. Same boundary-artifact cleanup that text() applies runs on the rendered output, extended to handle markdown-decorated bleed-in artifacts (# Item 5, **Item 5\\.**, # PART II\\n\\n# Item 5).

TOC-based sections (where node.children is empty and content is fetched lazily from _html_source) — fall back to Section.text(). Iterative review surfaced multiple correctness landmines in TOC-section HTML extraction: next-section heading leaks via nested anchors, shared-wrapper LCA computation, same-anchor start/end boundaries, inline anchor wrappers, last-section-in-wrapper leaks, and table-row-bounded anchors losing <table>/<tbody> wrappers. Rather than ship a partial fix for any one, the TOC path is conservative: no regression vs text(), no new landmines. Tracked as a follow-up.

Verification

13 new unit tests in tests/test_section_markdown.py cover: table-pipe preservation, bullet-list preservation, TOC fall-back contract, boundary-artifact cleanup across plain/escaped/heading-decorated/bold-decorated/combined-PART+Item variants, idempotency, and a negative control asserting text() does NOT preserve pipes (so the contrast is meaningful).
100/100 broader section-detection regression tests still pass with EDGAR_IDENTITY set (test_documents, test_10q_section_detection, test_sections_membership, test_section_detection_edge_cases, test_section_tables_toc_fix).
11 rounds of codex review — each round caught a real edge case, all addressed except the documented TOC fall-back (which would re-open the whole TOC extraction class of issues).

Test plan

CI runs the full repo test suite
Maintainer confirms the API name + scope
If maintainer wants TOC structural rendering shipped together, can split into a follow-up PR with the LCA-walk approach (incomplete vs. table-row anchors but covers the more common cases)

🤖 Generated with Claude Code

`Section.text()` walks the HTML subtree and emits newline-joined cell content — tables and bullet lists are flattened to space/newline soup, losing column structure and list markers. The whole-document `Filing.markdown()` (via `Document.to_markdown()`) preserves table pipe syntax and list markers but is whole-document only — there's no per-item slice. This means per-item chunkers / RAG pipelines have to choose between: - the cheap per-item path (`typed[item_name]` → `Section.text()`) which is item-aware but flattens; or - the faithful whole-document path (`filing.markdown()`) which preserves structure but loses item boundaries. `Section.markdown()` closes that gap for heading/pattern-detected sections: same scope as `text()` (one section) but the same renderer as `Document.to_markdown` so tables and lists keep their syntax. Implementation -------------- - Heading/pattern-based sections: render the cached node tree directly via `MarkdownRenderer().render_node(self.node)`. Tables emit pipe syntax, bullet lists keep their markers. Same boundary-artifact cleanup that `text()` applies runs on the rendered output, extended to handle markdown-decorated bleed-in artifacts (e.g. `# Item 5`, `**Item 5\.**`, `# PART II\n\n# Item 5`). - TOC-based sections (where `node.children` is empty and content is fetched lazily from the original HTML): fall back to `Section.text()`. Iterative codex review surfaced multiple edge cases for TOC-section HTML extraction (next-section heading leaks via nested anchors; shared-wrapper LCA computation; same-anchor start/end boundaries; inline anchor wrappers; last-section-in-wrapper leaks; table-row- bounded anchors losing `<table>`/`<tbody>` wrappers). Rather than ship a partial fix for any one of these, the TOC path is conservative and returns the same text the existing API already produces — no regression, no new correctness landmines. Full TOC markdown support is tracked as a follow-up. Verification ------------ - 13 new unit tests cover table-pipe preservation, bullet-list preservation, the TOC fall-back contract, boundary-artifact cleanup in plain text + markdown-escaped + markdown-decorated + combined PART+Item forms, idempotency, and a negative control asserting `text()` does NOT preserve pipes (so the contrast is meaningful). - 100/100 broader section-detection regression tests pass with `EDGAR_IDENTITY` set. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(documents): add Section.markdown() accessor#833

feat(documents): add Section.markdown() accessor#833
HonzaCuhel wants to merge 1 commit into
dgunning:mainfrom
HonzaCuhel:feat/section-markdown-accessor

HonzaCuhel commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

HonzaCuhel commented May 25, 2026

Summary

Why

Implementation

Verification

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant