fix(documents): preserve span-wrapped text in StreamingParser#830
fix(documents): preserve span-wrapped text in StreamingParser#830kevinchiu wants to merge 1 commit into
Conversation
The streaming HTML parser silently dropped text from <span>-wrapped paragraphs on filings that crossed streaming_threshold (default 10MB). For SEC filings in the ~30MB–110MB band — which routinely nest every word inside style-bearing <span> tags — filing.text() returned output 20%+ shorter than the non-streaming path with no exception or warning. Two compounding bugs in the iterparse loop: 1. elem.clear() ran on every event (both start and end). At start events, lxml's HTML-mode lookahead has populated child elements and their text; clearing at start destroyed that data before any handler could read it. 2. elem.clear() ran on every element regardless of whether an enclosing structural element (<p>, <h1>-<h6>, <section>) had finished reading its children. iterparse fires end events depth-first, so a child <span>'s end event cleared its .text and .tail before the parent <p>'s end event called _get_text_content(p). The pre-existing _table_depth gate already protected <table> from the same defect. Fix: clear only on end events, and gate clearing on a new _content_depth counter that tracks open p/h1-h6/section elements (mirroring _table_depth). Regression test exercises the SEC pattern of span-wrapped paragraph text under forced streaming mode.
dgunning
left a comment
There was a problem hiding this comment.
Strong fix, Kevin — diagnosis is precise (separating the start-event clearing from the missing content-depth gate as two distinct bugs is sharp), the _content_depth pattern mirrors _table_depth cleanly, and the cover-page recovery on Stepstone is convincing evidence the span bug was real.
One thing I want to understand before merging: your production table shows streaming-mode output now exceeds non-streaming by 25–67%.
| Filing | Pre-fix streaming | Non-streaming | Post-fix streaming | Δ vs non-streaming |
|---|---|---|---|---|
| Stepstone 10-K | 1,140,129 | 1,429,000 | 1,816,872 | +27% |
| Stepstone 20-F | — | 1,610,414 | 2,007,281 | +25% |
20-F 0001104659… |
— | 2,001,895 | 3,347,578 | +67% |
20-F 0001193125… |
— | 1,779,974 | 2,350,296 | +32% |
The span bug explains why pre-fix streaming was below non-streaming. It doesn't explain why post-fix streaming is above it. Three possibilities I can think of:
- Pre-existing divergence between paths (different whitespace/tail handling) that was masked while streaming was losing content
- Non-streaming has its own separate content-loss bug — possibly span-related at a different scale
- Streaming is now over-including — sibling
.tailaccumulating twice, or buffer flush interacting with the deferred clear
Have you compared the actual content (not just length) between the two paths on one of these filings? The +67% on 0001104659… is large enough that I'd want to know whether streaming is now correct and non-streaming is buggy, or vice versa, or both paths have different (defensible) semantics.
The regression test asserts content presence in both paths but doesn't compare lengths or do a content diff — adding a length-comparison assertion (or a diff on a known fixture) would lock in whichever interpretation is correct.
Not blocking the fix to the span bug — that's clearly the right call regardless. Just want to understand the overshoot before declaring streaming "fixed."
Symptom
filing.text()silently returns less text than expected for filingsthat cross
ParserConfig.streaming_threshold(default 10MB), with noexception and no warning.
Concrete measurement, Stepstone 10-K (
0001193125-26-128890, 42.7 MB raw HTML):The streaming path was dropping the entire cover-page block — including
"UNITED STATES SECURITIES AND EXCHANGE COMMISSION / FORM 10-K /
For the fiscal year ended…" — because every line of that block is
nested inside style-bearing
<span>tags.A minimal SEC-style snippet (each word wrapped in
<span style="…">)reproduces the same failure mode without network: streaming drops every
<p>entirely and keeps only<h*>text.Root cause
Two compounding bugs in the
iterparseloop inedgar/documents/utils/streaming.py::StreamingParser.parse:elem.clear()ran on every event (bothstartandend). Atstartevents, lxml's HTML-mode lookahead has already populated childelements and their
.text/.tail; structural handlers such as_start_headingread those at start time. Clearing onstartdestroyed that data before any handler could read it.
No content-depth gate around child clearing.
iterparsefiresendevents depth-first, so a child<span>'sendevent ranelem.clear()(which wipes.textand.tailin lxml) before theenclosing
<p>'sendevent called_get_text_content(p). SinceSEC filings nest essentially every word inside
<span style="…">,_end_paragraphsaw only empty children and produced empty paragraphtext. The pre-existing
_table_depthgate already protected<table>from the identical defect — this just extends the sameidea to the other structural containers.
Fix
Clear only on
endevents, and gate clearing on a new_content_depthcounter that tracks open
<p>/<h1>–<h6>/<section>elements(mirroring
_table_depth). Defers child cleanup until the enclosingstructural element has read its subtree.
Regression test
tests/test_html_parser_regressions.py::TestStreamingParserRegressions::test_streaming_preserves_span_wrapped_paragraph_text—uses a forced-streaming
ParserConfig(streaming_threshold=1)againstSEC-style span-wrapped HTML, asserts that all paragraph and heading
content survives, and cross-checks against the non-streaming baseline.
Fails on
main; passes with this change.Verification
uv run pytest tests/test_html_parser*.py— 68 passed, 3 skipped.End-to-end check on each of the four problem filings reported in
production. Streaming-path
filing.text()length after the fix, withthe non-streaming baseline alongside for reference:
0001193125-26-1288900001193125-26-1776170001104659-26-0444930001193125-26-183398All four return non-empty text on the streaming path, and the streaming
output begins with the expected SEC cover-page text on the Stepstone
10-K (previously truncated to body-only content).