Commit 8ef53b7
refactor: defer morsel decoder build to Morsel::into_stream
The previous `build_stream` built every morsel's `RowFilter`,
`ParquetPushDecoder`, `AsyncFileReader`, and `Projector` eagerly in a
single loop inside the file planner — before any morsel was scheduled.
That loop ran on the scheduler thread and was visible as a 10–15%
regression vs. main on ClickBench-partitioned queries that have many
row-group morsels per file (e.g. Q15, Q16 at pushdown=off).
Replace `ParquetStreamMorsel` (which held a pre-built `BoxStream`) with
`ParquetLazyMorsel`, which holds only the per-chunk `ParquetAccessPlan`
plus an `Arc<LazyMorselShared>` of the file-level state. The decoder
and reader are constructed inside `Morsel::into_stream`, so each
morsel pays its setup cost only when the scheduler actually picks it
up, and the work is distributed across worker threads instead of
serialised on the planner.
`FilePruner` is `!Clone` and drives whole-file early-stop via
`EarlyStoppingStream`, so it still lives on chunk 0's morsel only.
The warm `async_file_reader` from metadata / page-index / bloom-filter
load is dropped at the end of `build_stream` — every morsel mints a
fresh reader via the factory at `into_stream` time. For both built-in
factories (`DefaultParquetFileReaderFactory`,
`CachedParquetFileReaderFactory`) the "warm cache" benefit of reusing
a reader is negligible because the underlying `Arc<dyn ObjectStore>` /
`Arc<dyn FileMetadataCache>` is already shared across readers, so the
simplification is free.
Local ClickBench-partitioned, 10 iterations, pushdown=off (M-series):
| Query | main | eager (before) | lazy (this commit) |
|-------|------:|---------------:|-------------------:|
| Q14 | 325 | 335 | 313 ms |
| Q15 | 309 | 358 | 302 ms |
| Q16 | 911 | 1049 | 786 ms |
| Q24 | 48 | 55 | 56 ms |
| Q26 | 41 | 45 | 45 ms |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent ff805cf commit 8ef53b7
1 file changed
Lines changed: 239 additions & 167 deletions
0 commit comments