feat(documents): add Section.markdown() accessor#833
Open
HonzaCuhel wants to merge 1 commit into
Open
Conversation
`Section.text()` walks the HTML subtree and emits newline-joined cell content — tables and bullet lists are flattened to space/newline soup, losing column structure and list markers. The whole-document `Filing.markdown()` (via `Document.to_markdown()`) preserves table pipe syntax and list markers but is whole-document only — there's no per-item slice. This means per-item chunkers / RAG pipelines have to choose between: - the cheap per-item path (`typed[item_name]` → `Section.text()`) which is item-aware but flattens; or - the faithful whole-document path (`filing.markdown()`) which preserves structure but loses item boundaries. `Section.markdown()` closes that gap for heading/pattern-detected sections: same scope as `text()` (one section) but the same renderer as `Document.to_markdown` so tables and lists keep their syntax. Implementation -------------- - Heading/pattern-based sections: render the cached node tree directly via `MarkdownRenderer().render_node(self.node)`. Tables emit pipe syntax, bullet lists keep their markers. Same boundary-artifact cleanup that `text()` applies runs on the rendered output, extended to handle markdown-decorated bleed-in artifacts (e.g. `# Item 5`, `**Item 5\.**`, `# PART II\n\n# Item 5`). - TOC-based sections (where `node.children` is empty and content is fetched lazily from the original HTML): fall back to `Section.text()`. Iterative codex review surfaced multiple edge cases for TOC-section HTML extraction (next-section heading leaks via nested anchors; shared-wrapper LCA computation; same-anchor start/end boundaries; inline anchor wrappers; last-section-in-wrapper leaks; table-row- bounded anchors losing `<table>`/`<tbody>` wrappers). Rather than ship a partial fix for any one of these, the TOC path is conservative and returns the same text the existing API already produces — no regression, no new correctness landmines. Full TOC markdown support is tracked as a follow-up. Verification ------------ - 13 new unit tests cover table-pipe preservation, bullet-list preservation, the TOC fall-back contract, boundary-artifact cleanup in plain text + markdown-escaped + markdown-decorated + combined PART+Item forms, idempotency, and a negative control asserting `text()` does NOT preserve pipes (so the contrast is meaningful). - 100/100 broader section-detection regression tests pass with `EDGAR_IDENTITY` set. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Section.text()flattens tables and bullet lists;Filing.markdown()preserves structure but is whole-document. This addsSection.markdown()— the missing per-section markdown accessor — so per-item chunkers / RAG pipelines can get item-aware and structure-preserving markdown in one call.Why
Real-world example:
tenq[\"Part II, Item 2\"](Issuer Purchases of Equity Securities table) returns vertically-stacked cell text.filing.markdown()would render the same content as pipe-delimited markdown — but you lose the per-item slice. Today downstream consumers have to choose between item-awareness and structure preservation.section.markdown()removes the trade-off: one call returns markdown scoped to one section.Implementation
Heading/pattern-based sections — render the cached node tree via
MarkdownRenderer().render_node(self.node). Tables emit pipe syntax, bullet lists keep markers. Same boundary-artifact cleanup thattext()applies runs on the rendered output, extended to handle markdown-decorated bleed-in artifacts (# Item 5,**Item 5\\.**,# PART II\\n\\n# Item 5).TOC-based sections (where
node.childrenis empty and content is fetched lazily from_html_source) — fall back toSection.text(). Iterative review surfaced multiple correctness landmines in TOC-section HTML extraction: next-section heading leaks via nested anchors, shared-wrapper LCA computation, same-anchor start/end boundaries, inline anchor wrappers, last-section-in-wrapper leaks, and table-row-bounded anchors losing<table>/<tbody>wrappers. Rather than ship a partial fix for any one, the TOC path is conservative: no regression vstext(), no new landmines. Tracked as a follow-up.Verification
tests/test_section_markdown.pycover: table-pipe preservation, bullet-list preservation, TOC fall-back contract, boundary-artifact cleanup across plain/escaped/heading-decorated/bold-decorated/combined-PART+Item variants, idempotency, and a negative control assertingtext()does NOT preserve pipes (so the contrast is meaningful).EDGAR_IDENTITYset (test_documents,test_10q_section_detection,test_sections_membership,test_section_detection_edge_cases,test_section_tables_toc_fix).Test plan
🤖 Generated with Claude Code