Skip to content

fix(documents): preserve span-wrapped text in StreamingParser#830

Open
kevinchiu wants to merge 1 commit into
dgunning:mainfrom
kevinchiu:fix/streaming-parser-empty-content
Open

fix(documents): preserve span-wrapped text in StreamingParser#830
kevinchiu wants to merge 1 commit into
dgunning:mainfrom
kevinchiu:fix/streaming-parser-empty-content

Conversation

@kevinchiu
Copy link
Copy Markdown
Contributor

Symptom

filing.text() silently returns less text than expected for filings
that cross ParserConfig.streaming_threshold (default 10MB), with no
exception and no warning.

Concrete measurement, Stepstone 10-K (0001193125-26-128890, 42.7 MB raw HTML):

path text chars
streaming (before this fix) 1,140,129
non-streaming baseline 1,429,000
streaming (after this fix) 1,816,872

The streaming path was dropping the entire cover-page block — including
"UNITED STATES SECURITIES AND EXCHANGE COMMISSION / FORM 10-K /
For the fiscal year ended…" — because every line of that block is
nested inside style-bearing <span> tags.

A minimal SEC-style snippet (each word wrapped in <span style="…">)
reproduces the same failure mode without network: streaming drops every
<p> entirely and keeps only <h*> text.

Root cause

Two compounding bugs in the iterparse loop in
edgar/documents/utils/streaming.py::StreamingParser.parse:

  1. elem.clear() ran on every event (both start and end). At
    start events, lxml's HTML-mode lookahead has already populated child
    elements and their .text/.tail; structural handlers such as
    _start_heading read those at start time. Clearing on start
    destroyed that data before any handler could read it.

  2. No content-depth gate around child clearing. iterparse fires
    end events depth-first, so a child <span>'s end event ran
    elem.clear() (which wipes .text and .tail in lxml) before the
    enclosing <p>'s end event called _get_text_content(p). Since
    SEC filings nest essentially every word inside <span style="…">,
    _end_paragraph saw only empty children and produced empty paragraph
    text. The pre-existing _table_depth gate already protected
    <table> from the identical defect — this just extends the same
    idea to the other structural containers.

Fix

Clear only on end events, and gate clearing on a new _content_depth
counter that tracks open <p> / <h1><h6> / <section> elements
(mirroring _table_depth). Defers child cleanup until the enclosing
structural element has read its subtree.

Regression test

tests/test_html_parser_regressions.py::TestStreamingParserRegressions::test_streaming_preserves_span_wrapped_paragraph_text
uses a forced-streaming ParserConfig(streaming_threshold=1) against
SEC-style span-wrapped HTML, asserts that all paragraph and heading
content survives, and cross-checks against the non-streaming baseline.
Fails on main; passes with this change.

Verification

uv run pytest tests/test_html_parser*.py — 68 passed, 3 skipped.

End-to-end check on each of the four problem filings reported in
production. Streaming-path filing.text() length after the fix, with
the non-streaming baseline alongside for reference:

Filing Raw HTML Streaming after fix Non-streaming
Stepstone 10-K 0001193125-26-128890 42.7 MB 1,816,872 1,429,000
Stepstone 20-F 0001193125-26-177617 35.8 MB 2,007,281 1,610,414
20-F 0001104659-26-044493 39.5 MB 3,347,578 2,001,895
20-F 0001193125-26-183398 31.2 MB 2,350,296 1,779,974

All four return non-empty text on the streaming path, and the streaming
output begins with the expected SEC cover-page text on the Stepstone
10-K (previously truncated to body-only content).

The streaming HTML parser silently dropped text from <span>-wrapped
paragraphs on filings that crossed streaming_threshold (default 10MB).
For SEC filings in the ~30MB–110MB band — which routinely nest every
word inside style-bearing <span> tags — filing.text() returned output
20%+ shorter than the non-streaming path with no exception or warning.

Two compounding bugs in the iterparse loop:

1. elem.clear() ran on every event (both start and end). At start
   events, lxml's HTML-mode lookahead has populated child elements
   and their text; clearing at start destroyed that data before any
   handler could read it.

2. elem.clear() ran on every element regardless of whether an
   enclosing structural element (<p>, <h1>-<h6>, <section>) had
   finished reading its children. iterparse fires end events
   depth-first, so a child <span>'s end event cleared its .text and
   .tail before the parent <p>'s end event called
   _get_text_content(p). The pre-existing _table_depth gate already
   protected <table> from the same defect.

Fix: clear only on end events, and gate clearing on a new
_content_depth counter that tracks open p/h1-h6/section elements
(mirroring _table_depth). Regression test exercises the SEC pattern
of span-wrapped paragraph text under forced streaming mode.
Copy link
Copy Markdown
Owner

@dgunning dgunning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong fix, Kevin — diagnosis is precise (separating the start-event clearing from the missing content-depth gate as two distinct bugs is sharp), the _content_depth pattern mirrors _table_depth cleanly, and the cover-page recovery on Stepstone is convincing evidence the span bug was real.

One thing I want to understand before merging: your production table shows streaming-mode output now exceeds non-streaming by 25–67%.

Filing Pre-fix streaming Non-streaming Post-fix streaming Δ vs non-streaming
Stepstone 10-K 1,140,129 1,429,000 1,816,872 +27%
Stepstone 20-F 1,610,414 2,007,281 +25%
20-F 0001104659… 2,001,895 3,347,578 +67%
20-F 0001193125… 1,779,974 2,350,296 +32%

The span bug explains why pre-fix streaming was below non-streaming. It doesn't explain why post-fix streaming is above it. Three possibilities I can think of:

  1. Pre-existing divergence between paths (different whitespace/tail handling) that was masked while streaming was losing content
  2. Non-streaming has its own separate content-loss bug — possibly span-related at a different scale
  3. Streaming is now over-including — sibling .tail accumulating twice, or buffer flush interacting with the deferred clear

Have you compared the actual content (not just length) between the two paths on one of these filings? The +67% on 0001104659… is large enough that I'd want to know whether streaming is now correct and non-streaming is buggy, or vice versa, or both paths have different (defensible) semantics.

The regression test asserts content presence in both paths but doesn't compare lengths or do a content diff — adding a length-comparison assertion (or a diff on a known fixture) would lock in whichever interpretation is correct.

Not blocking the fix to the span bug — that's clearly the right call regardless. Just want to understand the overshoot before declaring streaming "fixed."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants