fix(documents): preserve span-wrapped text in StreamingParser by kevinchiu · Pull Request #830 · dgunning/edgartools

kevinchiu · 2026-05-23T02:50:27Z

Symptom

filing.text() silently returns less text than expected for filings
that cross ParserConfig.streaming_threshold (default 10MB), with no
exception and no warning.

Concrete measurement, Stepstone 10-K (0001193125-26-128890, 42.7 MB raw HTML):

path	text chars
streaming (before this fix)	1,140,129
non-streaming baseline	1,429,000
streaming (after this fix)	1,816,872

The streaming path was dropping the entire cover-page block — including
"UNITED STATES SECURITIES AND EXCHANGE COMMISSION / FORM 10-K /
For the fiscal year ended…" — because every line of that block is
nested inside style-bearing <span> tags.

A minimal SEC-style snippet (each word wrapped in <span style="…">)
reproduces the same failure mode without network: streaming drops every
<p> entirely and keeps only <h*> text.

Root cause

Two compounding bugs in the iterparse loop in
edgar/documents/utils/streaming.py::StreamingParser.parse:

elem.clear() ran on every event (both start and end). At
start events, lxml's HTML-mode lookahead has already populated child
elements and their .text/.tail; structural handlers such as
_start_heading read those at start time. Clearing on start
destroyed that data before any handler could read it.
No content-depth gate around child clearing. iterparse fires
end events depth-first, so a child <span>'s end event ran
elem.clear() (which wipes .text and .tail in lxml) before the
enclosing <p>'s end event called _get_text_content(p). Since
SEC filings nest essentially every word inside <span style="…">,
_end_paragraph saw only empty children and produced empty paragraph
text. The pre-existing _table_depth gate already protected
<table> from the identical defect — this just extends the same
idea to the other structural containers.

Fix

Clear only on end events, and gate clearing on a new _content_depth
counter that tracks open <p> / <h1>–<h6> / <section> elements
(mirroring _table_depth). Defers child cleanup until the enclosing
structural element has read its subtree.

Regression test

tests/test_html_parser_regressions.py::TestStreamingParserRegressions::test_streaming_preserves_span_wrapped_paragraph_text —
uses a forced-streaming ParserConfig(streaming_threshold=1) against
SEC-style span-wrapped HTML, asserts that all paragraph and heading
content survives, and cross-checks against the non-streaming baseline.
Fails on main; passes with this change.

Verification

uv run pytest tests/test_html_parser*.py — 68 passed, 3 skipped.

End-to-end check on each of the four problem filings reported in
production. Streaming-path filing.text() length after the fix, with
the non-streaming baseline alongside for reference:

Filing	Raw HTML	Streaming after fix	Non-streaming
Stepstone 10-K `0001193125-26-128890`	42.7 MB	1,816,872	1,429,000
Stepstone 20-F `0001193125-26-177617`	35.8 MB	2,007,281	1,610,414
20-F `0001104659-26-044493`	39.5 MB	3,347,578	2,001,895
20-F `0001193125-26-183398`	31.2 MB	2,350,296	1,779,974

All four return non-empty text on the streaming path, and the streaming
output begins with the expected SEC cover-page text on the Stepstone
10-K (previously truncated to body-only content).

The streaming HTML parser silently dropped text from <span>-wrapped paragraphs on filings that crossed streaming_threshold (default 10MB). For SEC filings in the ~30MB–110MB band — which routinely nest every word inside style-bearing <span> tags — filing.text() returned output 20%+ shorter than the non-streaming path with no exception or warning. Two compounding bugs in the iterparse loop: 1. elem.clear() ran on every event (both start and end). At start events, lxml's HTML-mode lookahead has populated child elements and their text; clearing at start destroyed that data before any handler could read it. 2. elem.clear() ran on every element regardless of whether an enclosing structural element (<p>, <h1>-<h6>, <section>) had finished reading its children. iterparse fires end events depth-first, so a child <span>'s end event cleared its .text and .tail before the parent <p>'s end event called _get_text_content(p). The pre-existing _table_depth gate already protected <table> from the same defect. Fix: clear only on end events, and gate clearing on a new _content_depth counter that tracks open p/h1-h6/section elements (mirroring _table_depth). Regression test exercises the SEC pattern of span-wrapped paragraph text under forced streaming mode.

dgunning

Strong fix, Kevin — diagnosis is precise (separating the start-event clearing from the missing content-depth gate as two distinct bugs is sharp), the _content_depth pattern mirrors _table_depth cleanly, and the cover-page recovery on Stepstone is convincing evidence the span bug was real.

One thing I want to understand before merging: your production table shows streaming-mode output now exceeds non-streaming by 25–67%.

Filing	Pre-fix streaming	Non-streaming	Post-fix streaming	Δ vs non-streaming
Stepstone 10-K	1,140,129	1,429,000	1,816,872	+27%
Stepstone 20-F	—	1,610,414	2,007,281	+25%
20-F `0001104659…`	—	2,001,895	3,347,578	+67%
20-F `0001193125…`	—	1,779,974	2,350,296	+32%

The span bug explains why pre-fix streaming was below non-streaming. It doesn't explain why post-fix streaming is above it. Three possibilities I can think of:

Pre-existing divergence between paths (different whitespace/tail handling) that was masked while streaming was losing content
Non-streaming has its own separate content-loss bug — possibly span-related at a different scale
Streaming is now over-including — sibling .tail accumulating twice, or buffer flush interacting with the deferred clear

Have you compared the actual content (not just length) between the two paths on one of these filings? The +67% on 0001104659… is large enough that I'd want to know whether streaming is now correct and non-streaming is buggy, or vice versa, or both paths have different (defensible) semantics.

The regression test asserts content presence in both paths but doesn't compare lengths or do a content diff — adding a length-comparison assertion (or a diff on a known fixture) would lock in whichever interpretation is correct.

Not blocking the fix to the span bug — that's clearly the right call regardless. Just want to understand the overshoot before declaring streaming "fixed."

dgunning reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(documents): preserve span-wrapped text in StreamingParser#830

fix(documents): preserve span-wrapped text in StreamingParser#830
kevinchiu wants to merge 1 commit into
dgunning:mainfrom
kevinchiu:fix/streaming-parser-empty-content

kevinchiu commented May 23, 2026

Uh oh!

dgunning left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kevinchiu commented May 23, 2026

Symptom

Root cause

Fix

Regression test

Verification

Uh oh!

dgunning left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants