feat: split Parquet files into row-group-sized morsels by adriangb · Pull Request #59 · pydantic/datafusion

adriangb · 2026-04-20T23:43:43Z

Which issue does this PR close?

Closes #.

Rationale for this change

Follow-up to PR apache#21351 (Dynamic work scheduling in FileStream), which explicitly called out "Splitting files into smaller units (e.g. across row groups)" as a deferred next step. Today each Parquet file becomes exactly one morsel — a single ParquetPushDecoder over the entire pruned ParquetAccessPlan. That coarse granularity:

keeps the follow-on "steal row-group work across sibling FileStreams" effort blocked — there's no row-group-sized unit for the SharedWorkSource to hand out;
prevents pipelining row-group decode with downstream operator work inside a single partition;
makes LIMIT and dynamic-filter early-stop wait for whole-file granularity.

The Morselizer / MorselPlanner abstraction already supports returning Vec<Box<dyn Morsel>> from a single MorselPlan, and ScanState already drains multi-morsel plans FIFO. This PR takes advantage of that: RowGroupsPrunedParquetOpen::build_stream now emits N streams, one per row-group chunk.

What changes are included in this PR?

ParquetAccessPlan::split_into_chunks (new, pub(crate)). Packs consecutive surviving row groups into chunks bounded by a row budget (default 100k rows) and a compressed-byte budget (default 64 MiB). A single oversized row group still becomes its own chunk; no sub-row-group splitting is introduced in this PR. Skip entries are carried by the currently open chunk without forcing a boundary.
RowGroupsPrunedParquetOpen::build_stream now returns Vec<BoxStream> instead of a single stream. Per chunk: prepare the access plan slice, build a fresh ParquetPushDecoder, mint a fresh AsyncFileReader via ParquetFileReaderFactory::create_reader (the first chunk reuses the reader that loaded metadata / page index / bloom filters so its warm cache state is preserved). Row filter is rebuilt per chunk because RowFilter is not Clone.
ParquetOpenState::Ready now holds Vec<BoxStream>. ParquetMorselPlanner::plan maps each stream into a ParquetStreamMorsel. Empty Ready (file fully pruned) terminates the planner via Ok(None) instead of emitting an empty morsel plan.
EarlyStoppingStream attaches to the first chunk only. FilePruner is !Clone and stateful; keeping it on chunk 0 preserves whole-file early-stop on dynamic-filter narrowing.
Row-group reversal for sort pushdown is applied per-chunk via the existing PreparedAccessPlan::reverse, and the chunks Vec is reversed so the first emitted morsel corresponds to what was originally the file's last row groups.
Chunk budgets are fields on ParquetMorselizer, defaulting to the new module-level constants DEFAULT_MORSEL_MAX_ROWS / DEFAULT_MORSEL_MAX_COMPRESSED_BYTES. Wiring through ParquetSource for user configuration is deliberately deferred to a follow-up — this PR keeps the public surface unchanged.

Are these changes tested?

Yes.

Unit tests on ParquetAccessPlan::split_into_chunks cover: empty plan, all-skip, one-chunk-per-row-group when budget is tight, packing when budget allows, oversized single row group, byte-bounded splits, Skip preserved inside a chunk, Skip between chunks, Selection preserved verbatim in its chunk.
Integration tests in opener.rs:
- test_row_group_split_produces_multiple_morsels — 3 row groups × 3 rows with a row budget of 3 → 3 morsels; concatenated output matches the single-morsel reference.
- test_row_group_split_packs_within_budget — budget of 6 rows packs 2 row groups into the first morsel, leaving the third in its own.
- test_row_group_split_honors_user_skip — a user-supplied ParquetAccessPlan with Skip in the middle round-trips correctly.
- test_row_group_split_with_reverse — reverse_row_groups=true + split budget emits morsels with the originally-last row group first.
All 111 existing datasource-parquet library tests still pass (including the full test_reverse_scan_* suite). datafusion core lib (403) and core integration (931) also pass unchanged.

Are there any user-facing changes?

No observable semantic change — scans still produce the same rows in the same order. Runtime behavior changes: a multi-row-group Parquet scan now produces multiple morsels per file, each with its own ParquetPushDecoder + AsyncFileReader. For very small row groups this introduces per-morsel setup overhead that's amortized by the packing budget; for queries that hit LIMIT or dynamic-filter prunes mid-file, it generally lets the scan stop sooner.

🤖 Generated with Claude Code

adriangb · 2026-04-20T23:48:49Z

run benchmarks

…pache#21708) ## Which issue does this PR close?  - Closes #. ## Rationale for this change  ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?

…nputs (apache#21704) ## Which issue does this PR close? - Closes apache#21702. ## Rationale for this change `array_concat` hit an internal cast error when given a mix of `List` and `LargeList` (or `FixedSizeList` and `LargeList`) arguments: ```sql > select array_concat(make_array(1, 2), arrow_cast([3, 4], 'LargeList(Int64)')); DataFusion error: Internal error: could not cast array of type List(Int64) to arrow_array::array::list_array::GenericListArray<i64>. ``` `ArrayConcat::coerce_types` was coercing only the base element type, leaving the outer container alone. When the resolved return type is `LargeList`, `array_concat_inner` later tries to downcast each arg to `GenericListArray<i64>`, which fails for any `List` argument that slipped through. ## What changes are included in this PR? In `ArrayConcat::coerce_types`, after coercing the base type, also promote each input's outermost `List` to `LargeList` when the return type is a `LargeList`. `FixedSizeList` inputs already go through `FixedSizedListToList` first and then get promoted too. Per-arg dimensionality is preserved, so nested cases keep working with `align_array_dimensions`. ## Are these changes tested? Yes, added sqllogictests in `array_concat.slt` covering: - `List` + `LargeList` - `LargeList` + `List` - `FixedSizeList` + `LargeList` - Three-way mix `List`, `LargeList`, `List` Each one also asserts `arrow_typeof(...) = LargeList(Int64)`. ## Are there any user-facing changes? Queries that previously returned an internal cast error now return the concatenated `LargeList` as expected. No API changes.

…messages (apache#20387) ## Which issue does this PR close? - Closes apache#20386. ## Rationale for this change `memory_limit` (`RuntimeEnvBuilder::new().with_memory_limit()`) configuration uses `greedy` memory pool as `default`. However, if `memory_pool` (`RuntimeEnvBuilder::new().with_memory_pool()`) is set, it overrides by expected `memory_pool` config such as `fair`. Also, if both `memory_limit` and `memory_pool` configs are not set, `unbounded` memory pool will be used so it can be useful to expose `ultimately used/selected pool` as part of `ResourcesExhausted` error message for the end user awareness and the user may need to switch used memory pool (`greedy`, `fair`, `unbounded`), - Also, [this comparison table](lance-format/lance#3601 (comment)) is an example use-case for both `greedy` and `fair` memory pools runtime behaviors and this addition can help for this kind of comparison table by exposing used memory pool info as part of native logs. Please find following example use-cases by `datafusion-cli`: **Case1**: datafusion-cli result when `memory-limit` and `top-memory-consumers > 0` are set: ``` eren.avsarogullari@AWGNPWVK961 debug % ./datafusion-cli --memory-limit 10M --command 'select * from generate_series(1,500000) as t1(v1) order by v1;' --top-memory-consumers 3 DataFusion CLI v53.0.0 Error: Not enough memory to continue external sort. Consider increasing the memory limit config: 'datafusion.runtime.memory_limit', or decreasing the config: 'datafusion.execution.sort_spill_reservation_bytes'. caused by Resources exhausted: Additional allocation failed for ExternalSorter[0] with top memory consumers (across reservations) as: ExternalSorterMerge[0]#2(can spill: false) consumed 10.0 MB, peak 10.0 MB, DataFusion-Cli#0(can spill: false) consumed 0.0 B, peak 0.0 B, ExternalSorter[0]#1(can spill: true) consumed 0.0 B, peak 0.0 B. Error: Failed to allocate additional 128.0 KB for ExternalSorter[0] with 0.0 B already allocated for this reservation - 0.0 B remain available for the total memory pool: greedy(used: 10.0 MB, pool_size: 10.0 MB) ``` **Case2**: datafusion-cli result when `memory-limit` and `top-memory-consumers = 0` (disabling top memory consumers logging) are set: ``` eren.avsarogullari@AWGNPWVK961 debug % ./datafusion-cli --memory-limit 10M --command 'select * from generate_series(1,500000) as t1(v1) order by v1;' --top-memory-consumers 0 DataFusion CLI v53.0.0 Error: Not enough memory to continue external sort. Consider increasing the memory limit config: 'datafusion.runtime.memory_limit', or decreasing the config: 'datafusion.execution.sort_spill_reservation_bytes'. caused by Resources exhausted: Failed to allocate additional 128.0 KB for ExternalSorter[0] with 0.0 B already allocated for this reservation - 0.0 B remain available for the total memory pool: greedy(used: 10.0 MB, pool_size: 10.0 MB) ``` **Case3**: datafusion-cli result when only `memory-limit`, `memory-pool` and `top-memory-consumers > 0` are set: ``` eren.avsarogullari@AWGNPWVK961 debug % ./datafusion-cli --memory-limit 10M --mem-pool-type fair --top-memory-consumers 3 --command 'select * from generate_series(1,500000) as t1(v1) order by v1;' DataFusion CLI v53.0.0 Error: Not enough memory to continue external sort. Consider increasing the memory limit config: 'datafusion.runtime.memory_limit', or decreasing the config: 'datafusion.execution.sort_spill_reservation_bytes'. caused by Resources exhausted: Additional allocation failed for ExternalSorter[0] with top memory consumers (across reservations) as: ExternalSorterMerge[0]#2(can spill: false) consumed 10.0 MB, peak 10.0 MB, ExternalSorter[0]#1(can spill: true) consumed 0.0 B, peak 0.0 B, DataFusion-Cli#0(can spill: false) consumed 0.0 B, peak 0.0 B. Error: Failed to allocate additional 128.0 KB for ExternalSorter[0] with 0.0 B already allocated for this reservation - 0.0 B remain available for the total memory pool: fair(pool_size: 10.0 MB) ``` ## What changes are included in this PR? - Adding name property to MemoryPool instances, - Expose used MemoryPool info to Resources Exhausted error messages ## Are these changes tested? Yes and updating existing test cases. ## Are there any user-facing changes? Yes, being updated Resources Exhausted error messages.

…pache#21749) ## Which issue does this PR close? - Closes apache#21751. ## Rationale for this change Profiling the planner suggests that a surprising amount of time was being spent doing tree rewriting in the logical optimizer. One culprit is `TreeNodeContainer::map_elements()` for `Box<C>` and `Arc<C>`, which do the following: * Fetch the inner `C` value from the `Box`/`Arc` * Pass the innter value to the closure * Wrap the return value of the closure in a newly allocated `Box` / `Arc`, respectively This allocates a fresh `Box` or `Arc` for every node visited while walking an expression or logical plan, even if the tree rewrite we're doing didn't modify the expression/plan node. Instead, we can reuse the current `Box<C>` or `Arc<C>`: use `std::mem::take()` to swap the inner value with `C::default()`, pass the inner value to the closure, and put the result back in the original container. Swapping the inner value with `C::default()` means the container always has a valid value, which is important if the closure panics. For `Arc<C>`, we need to use `Arc::make_mut()`, which only clones if the `Arc` is not unique. This reduces the bytes allocated to plan TPC-H Q13 by ~22% (988 kB -> 765 kB), and reduces allocated blocks by 8.5% (210k -> 192k). ## What changes are included in this PR? * Optimize `Box<C>::map_elements()` and `Arc<C>::map_elements()` as described above * Change `map_children()` for `Expr::Alias` to use `map_elements()`, rather than invoking `f(*expr)` directly; this ensures that it can take advantage of this optimization * Make `LogicalPlan::default()` use a shared `DFSchema`, rather than allocating a fresh `DFSchema` for every call. Because `default()` is not in the hot path for tree rewriting, it is important that it is cheap * Add unit tests for new `map_elements()` behavior * Add note to migration guide for breaking API change ## Are these changes tested? Yes, plus new unit tests added. ## Are there any user-facing changes? Yes: `TreeNodeContainer` impls for `Box<C>` and `Arc<C>` now require `C: Default`. This is a breaking API change for third-party code that implements `TreeNodeContainer` for a custom type. The fix is usually straightforward.

…nts (apache#20904) ## Which issue does this PR close? Does not close but part of apache#20766 ## Rationale for this change Details are in apache#20766. But main idea is to use existing distinct count information to optimize joins similar to how Spark/Trino does ## What changes are included in this PR? This PR extends cardinality estimation for semi/anti joins using distinct counts ## Are these changes tested? I've added cases but not sure if I should've added benchmarks on this. ## Are there any user-facing changes? No --------- Co-authored-by: Alessandro Solimando <alessandro.solimando@gmail.com>

## Which issue does this PR close?  - Closes #. ## Rationale for this change  One test case in `datafusion-cli` crate is failing locally if you run all tests through `cargo nextest run`, but passes for `cargo test` ``` FAIL [ 0.375s] datafusion-cli::cli_integration cli_explain_environment_overrides ``` The reason is `nextest` triggers a different build graph, which enforces a feature flag in `serde_json` dependency. This PR enforces this feature in the `dev-dependencies` in `datafusion-cli` crate, so the test become deterministic under different test setup. apache#21502 Fixed a similar issue, and also explains why not enabling it in the global dependencies inside `Cargo.toml` ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?

…argo-deps group (apache#21760) Bumps the all-other-cargo-deps group with 1 update: [aws-config](https://github.com/smithy-lang/smithy-rs). Updates `aws-config` from 1.8.15 to 1.8.16 <details> <summary>Commits</summary> <ul> <li>See full diff in <a href="https://github.com/smithy-lang/smithy-rs/commits">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=aws-config&package-manager=cargo&previous-version=1.8.15&new-version=1.8.16)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore <dependency name> major version` will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself) - `@dependabot ignore <dependency name> minor version` will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself) - `@dependabot ignore <dependency name>` will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself) - `@dependabot unignore <dependency name>` will remove all of the ignore conditions of the specified dependency - `@dependabot unignore <dependency name> <ignore condition>` will remove the ignore condition of the specified dependency and ignore conditions </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…21758) Bumps [github/codeql-action](https://github.com/github/codeql-action) from 4.35.1 to 4.35.2. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/github/codeql-action/releases">github/codeql-action's releases</a>.</em></p> <blockquote> <h2>v4.35.2</h2> <ul> <li>The undocumented TRAP cache cleanup feature that could be enabled using the <code>CODEQL_ACTION_CLEANUP_TRAP_CACHES</code> environment variable is deprecated and will be removed in May 2026. If you are affected by this, we recommend disabling TRAP caching by passing the <code>trap-caching: false</code> input to the <code>init</code> Action. <a href="https://redirect.github.com/github/codeql-action/pull/3795">#3795</a></li> <li>The Git version 2.36.0 requirement for improved incremental analysis now only applies to repositories that contain submodules. <a href="https://redirect.github.com/github/codeql-action/pull/3789">#3789</a></li> <li>Python analysis on GHES no longer extracts the standard library, relying instead on models of the standard library. This should result in significantly faster extraction and analysis times, while the effect on alerts should be minimal. <a href="https://redirect.github.com/github/codeql-action/pull/3794">#3794</a></li> <li>Fixed a bug in the validation of OIDC configurations for private registries that was added in CodeQL Action 4.33.0 / 3.33.0. <a href="https://redirect.github.com/github/codeql-action/pull/3807">#3807</a></li> <li>Update default CodeQL bundle version to <a href="https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.25.2">2.25.2</a>. <a href="https://redirect.github.com/github/codeql-action/pull/3823">#3823</a></li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/github/codeql-action/blob/main/CHANGELOG.md">github/codeql-action's changelog</a>.</em></p> <blockquote> <h1>CodeQL Action Changelog</h1> <p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p> <h2>[UNRELEASED]</h2> <p>No user facing changes.</p> <h2>4.35.2 - 15 Apr 2026</h2> <ul> <li>The undocumented TRAP cache cleanup feature that could be enabled using the <code>CODEQL_ACTION_CLEANUP_TRAP_CACHES</code> environment variable is deprecated and will be removed in May 2026. If you are affected by this, we recommend disabling TRAP caching by passing the <code>trap-caching: false</code> input to the <code>init</code> Action. <a href="https://redirect.github.com/github/codeql-action/pull/3795">#3795</a></li> <li>The Git version 2.36.0 requirement for improved incremental analysis now only applies to repositories that contain submodules. <a href="https://redirect.github.com/github/codeql-action/pull/3789">#3789</a></li> <li>Python analysis on GHES no longer extracts the standard library, relying instead on models of the standard library. This should result in significantly faster extraction and analysis times, while the effect on alerts should be minimal. <a href="https://redirect.github.com/github/codeql-action/pull/3794">#3794</a></li> <li>Fixed a bug in the validation of OIDC configurations for private registries that was added in CodeQL Action 4.33.0 / 3.33.0. <a href="https://redirect.github.com/github/codeql-action/pull/3807">#3807</a></li> <li>Update default CodeQL bundle version to <a href="https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.25.2">2.25.2</a>. <a href="https://redirect.github.com/github/codeql-action/pull/3823">#3823</a></li> </ul> <h2>4.35.1 - 27 Mar 2026</h2> <ul> <li>Fix incorrect minimum required Git version for <a href="https://redirect.github.com/github/roadmap/issues/1158">improved incremental analysis</a>: it should have been 2.36.0, not 2.11.0. <a href="https://redirect.github.com/github/codeql-action/pull/3781">#3781</a></li> </ul> <h2>4.35.0 - 27 Mar 2026</h2> <ul> <li>Reduced the minimum Git version required for <a href="https://redirect.github.com/github/roadmap/issues/1158">improved incremental analysis</a> from 2.38.0 to 2.11.0. <a href="https://redirect.github.com/github/codeql-action/pull/3767">#3767</a></li> <li>Update default CodeQL bundle version to <a href="https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.25.1">2.25.1</a>. <a href="https://redirect.github.com/github/codeql-action/pull/3773">#3773</a></li> </ul> <h2>4.34.1 - 20 Mar 2026</h2> <ul> <li>Downgrade default CodeQL bundle version to <a href="https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.24.3">2.24.3</a> due to issues with a small percentage of Actions and JavaScript analyses. <a href="https://redirect.github.com/github/codeql-action/pull/3762">#3762</a></li> </ul> <h2>4.34.0 - 20 Mar 2026</h2> <ul> <li>Added an experimental change which disables TRAP caching when <a href="https://redirect.github.com/github/roadmap/issues/1158">improved incremental analysis</a> is enabled, since improved incremental analysis supersedes TRAP caching. This will improve performance and reduce Actions cache usage. We expect to roll this change out to everyone in March. <a href="https://redirect.github.com/github/codeql-action/pull/3569">#3569</a></li> <li>We are rolling out improved incremental analysis to C/C++ analyses that use build mode <code>none</code>. We expect this rollout to be complete by the end of April 2026. <a href="https://redirect.github.com/github/codeql-action/pull/3584">#3584</a></li> <li>Update default CodeQL bundle version to <a href="https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.25.0">2.25.0</a>. <a href="https://redirect.github.com/github/codeql-action/pull/3585">#3585</a></li> </ul> <h2>4.33.0 - 16 Mar 2026</h2> <ul> <li> <p>Upcoming change: Starting April 2026, the CodeQL Action will skip collecting file coverage information on pull requests to improve analysis performance. File coverage information will still be computed on non-PR analyses. Pull request analyses will log a warning about this upcoming change. <a href="https://redirect.github.com/github/codeql-action/pull/3562">#3562</a></p> <p>To opt out of this change:</p> <ul> <li><strong>Repositories owned by an organization:</strong> Create a custom repository property with the name <code>github-codeql-file-coverage-on-prs</code> and the type "True/false", then set this property to <code>true</code> in the repository's settings. For more information, see <a href="https://docs.github.com/en/organizations/managing-organization-settings/managing-custom-properties-for-repositories-in-your-organization">Managing custom properties for repositories in your organization</a>. Alternatively, if you are using an advanced setup workflow, you can set the <code>CODEQL_ACTION_FILE_COVERAGE_ON_PRS</code> environment variable to <code>true</code> in your workflow.</li> <li><strong>User-owned repositories using default setup:</strong> Switch to an advanced setup workflow and set the <code>CODEQL_ACTION_FILE_COVERAGE_ON_PRS</code> environment variable to <code>true</code> in your workflow.</li> <li><strong>User-owned repositories using advanced setup:</strong> Set the <code>CODEQL_ACTION_FILE_COVERAGE_ON_PRS</code> environment variable to <code>true</code> in your workflow.</li> </ul> </li> <li> <p>Fixed <a href="https://redirect.github.com/github/codeql-action/issues/3555">a bug</a> which caused the CodeQL Action to fail loading repository properties if a "Multi select" repository property was configured for the repository. <a href="https://redirect.github.com/github/codeql-action/pull/3557">#3557</a></p> </li> <li> <p>The CodeQL Action now loads <a href="https://docs.github.com/en/organizations/managing-organization-settings/managing-custom-properties-for-repositories-in-your-organization">custom repository properties</a> on GitHub Enterprise Server, enabling the customization of features such as <code>github-codeql-disable-overlay</code> that was previously only available on GitHub.com. <a href="https://redirect.github.com/github/codeql-action/pull/3559">#3559</a></p> </li> <li> <p>Once <a href="https://docs.github.com/en/code-security/how-tos/secure-at-scale/configure-organization-security/manage-usage-and-access/giving-org-access-private-registries">private package registries</a> can be configured with OIDC-based authentication for organizations, the CodeQL Action will now be able to accept such configurations. <a href="https://redirect.github.com/github/codeql-action/pull/3563">#3563</a></p> </li> <li> <p>Fixed the retry mechanism for database uploads. Previously this would fail with the error "Response body object should not be disturbed or locked". <a href="https://redirect.github.com/github/codeql-action/pull/3564">#3564</a></p> </li> <li> <p>A warning is now emitted if the CodeQL Action detects a repository property whose name suggests that it relates to the CodeQL Action, but which is not one of the properties recognised by the current version of the CodeQL Action. <a href="https://redirect.github.com/github/codeql-action/pull/3570">#3570</a></p> </li> </ul> <h2>4.32.6 - 05 Mar 2026</h2>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/github/codeql-action/commit/95e58e9a2cdfd71adc6e0353d5c52f41a045d225"><code>95e58e9</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3824">#3824</a> from github/update-v4.35.2-d2e135a73</li> <li><a href="https://github.com/github/codeql-action/commit/6f31bfe060e817d81e938dbec767969d20031e25"><code>6f31bfe</code></a> Update changelog for v4.35.2</li> <li><a href="https://github.com/github/codeql-action/commit/d2e135a73a39154e3a231aeb49163c4661c5b8b1"><code>d2e135a</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3823">#3823</a> from github/update-bundle/codeql-bundle-v2.25.2</li> <li><a href="https://github.com/github/codeql-action/commit/60abb65df09fcf213c398e064c8a80db1f15cdaf"><code>60abb65</code></a> Add changelog note</li> <li><a href="https://github.com/github/codeql-action/commit/5a0a562209255e956ad8aafcee303294e64eefa2"><code>5a0a562</code></a> Update default bundle to codeql-bundle-v2.25.2</li> <li><a href="https://github.com/github/codeql-action/commit/65216971a11ded447a6b76263d5a144519e5eee1"><code>6521697</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3820">#3820</a> from github/dependabot/github_actions/dot-github/wor...</li> <li><a href="https://github.com/github/codeql-action/commit/3c45af2dd258e1623af1898da5c86545b514e028"><code>3c45af2</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3821">#3821</a> from github/dependabot/npm_and_yarn/npm-minor-345b93...</li> <li><a href="https://github.com/github/codeql-action/commit/f1c339364c12f922998186ed897e45e3b4ae8874"><code>f1c3393</code></a> Rebuild</li> <li><a href="https://github.com/github/codeql-action/commit/1024fc496c87e944a93e98d8cf2c09e2c7602a30"><code>1024fc4</code></a> Rebuild</li> <li><a href="https://github.com/github/codeql-action/commit/9dd4cfed96030ccdfe1af4daf7a7964322704fed"><code>9dd4cfe</code></a> Bump the npm-minor group across 1 directory with 6 updates</li> <li>Additional commits viewable in <a href="https://github.com/github/codeql-action/compare/c10b8064de6f491fea524254123dbe5e09572f13...95e58e9a2cdfd71adc6e0353d5c52f41a045d225">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=github/codeql-action&package-manager=github_actions&previous-version=4.35.1&new-version=4.35.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…che#21757) Bumps [taiki-e/install-action](https://github.com/taiki-e/install-action) from 2.75.10 to 2.75.18. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/taiki-e/install-action/releases">taiki-e/install-action's releases</a>.</em></p> <blockquote> <h2>2.75.18</h2> <ul> <li> <p>Update <code>vacuum@latest</code> to 0.26.1.</p> </li> <li> <p>Update <code>wasm-tools@latest</code> to 1.247.0.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.16.</p> </li> <li> <p>Update <code>espup@latest</code> to 0.17.1.</p> </li> <li> <p>Update <code>trivy@latest</code> to 0.70.0.</p> </li> </ul> <h2>2.75.17</h2> <ul> <li> <p>Update <code>tombi@latest</code> to 0.9.18.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.15.</p> </li> </ul> <h2>2.75.16</h2> <ul> <li> <p>Update <code>uv@latest</code> to 0.11.7.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.14.</p> </li> <li> <p>Update <code>vacuum@latest</code> to 0.25.9.</p> </li> <li> <p>Update <code>cargo-machete@latest</code> to 0.9.2.</p> </li> <li> <p>Update <code>cargo-deny@latest</code> to 0.19.4.</p> </li> </ul> <h2>2.75.15</h2> <ul> <li> <p>Update <code>cargo-nextest@latest</code> to 0.9.133.</p> </li> <li> <p>Update <code>biome@latest</code> to 2.4.12.</p> </li> </ul> <h2>2.75.14</h2> <ul> <li> <p>Implement potential workaround for <a href="https://redirect.github.com/actions/partner-runner-images/issues/169">windows-11-arm runner bug</a> which sometimes causes installation failure.</p> <p>The issue where this bug affected the startup of bash was addressed in 2.71.2, but we received a report that the <a href="https://redirect.github.com/taiki-e/install-action/pull/1657#issuecomment-4252717651">same problem seems to occur when starting other commands as well</a>.</p> </li> <li> <p>Update <code>cargo-deny@latest</code> to 0.19.2.</p> </li> </ul> <h2>2.75.13</h2> <ul> <li>Update <code>zizmor@latest</code> to 1.24.1.</li> </ul> <h2>2.75.12</h2> <ul> <li> <p>Update <code>typos@latest</code> to 1.45.1.</p> </li> <li> <p>Update <code>cargo-xwin@latest</code> to 0.21.5.</p> </li> <li> <p>Update <code>cargo-binstall@latest</code> to 1.18.1.</p> </li> </ul> <h2>2.75.11</h2>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/taiki-e/install-action/blob/main/CHANGELOG.md">taiki-e/install-action's changelog</a>.</em></p> <blockquote> <h1>Changelog</h1> <p>All notable changes to this project will be documented in this file.</p> <p>This project adheres to <a href="https://semver.org">Semantic Versioning</a>.</p>  <h2>[Unreleased]</h2> <ul> <li> <p>Update <code>tombi@latest</code> to 0.9.20.</p> </li> <li> <p>Update <code>martin@latest</code> to 1.6.0.</p> </li> <li> <p>Update <code>just@latest</code> to 1.50.0.</p> </li> <li> <p>Update <code>tombi@latest</code> to 0.9.19.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.18.</p> </li> <li> <p>Update <code>rclone@latest</code> to 1.73.5.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.17.</p> </li> </ul> <h2>[2.75.18] - 2026-04-19</h2> <ul> <li> <p>Update <code>vacuum@latest</code> to 0.26.1.</p> </li> <li> <p>Update <code>wasm-tools@latest</code> to 1.247.0.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.16.</p> </li> <li> <p>Update <code>espup@latest</code> to 0.17.1.</p> </li> <li> <p>Update <code>trivy@latest</code> to 0.70.0.</p> </li> </ul> <h2>[2.75.17] - 2026-04-17</h2> <ul> <li> <p>Update <code>tombi@latest</code> to 0.9.18.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.15.</p> </li> </ul> <h2>[2.75.16] - 2026-04-17</h2> <ul> <li> <p>Update <code>uv@latest</code> to 0.11.7.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.14.</p> </li> </ul>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/taiki-e/install-action/commit/055f5df8c3f65ea01cd41e9dc855becd88953486"><code>055f5df</code></a> Release 2.75.18</li> <li><a href="https://github.com/taiki-e/install-action/commit/eabf60349346950549ed65f6beb018b4680f7968"><code>eabf603</code></a> Add note about unset</li> <li><a href="https://github.com/taiki-e/install-action/commit/4637b48a5ac188fd1395ec47093a2f53f6e1a2b3"><code>4637b48</code></a> Early handle inputs</li> <li><a href="https://github.com/taiki-e/install-action/commit/7a6306ece23f52d1c9356f8fe0d0dd0f791c7825"><code>7a6306e</code></a> Update <code>vacuum@latest</code> to 0.26.1</li> <li><a href="https://github.com/taiki-e/install-action/commit/cb13f5ef5263e03d2a7c5675b24ba8374dab72b4"><code>cb13f5e</code></a> Update mise manifest</li> <li><a href="https://github.com/taiki-e/install-action/commit/18cc1a4fb7bd8a9c7c6fc69fda6c5b6b6c477b3c"><code>18cc1a4</code></a> Update <code>wasm-tools@latest</code> to 1.247.0</li> <li><a href="https://github.com/taiki-e/install-action/commit/c7b05077fec4d0c69ebf2b84456491ae0e31295d"><code>c7b0507</code></a> Update <code>mise@latest</code> to 2026.4.16</li> <li><a href="https://github.com/taiki-e/install-action/commit/0ef4e7650f60cd0dce197648e865d433e0a15151"><code>0ef4e76</code></a> Update <code>espup@latest</code> to 0.17.1</li> <li><a href="https://github.com/taiki-e/install-action/commit/56ec35f1c0ea059ed79d67351a8376410b7a3c87"><code>56ec35f</code></a> Update <code>trivy@latest</code> to 0.70.0</li> <li><a href="https://github.com/taiki-e/install-action/commit/6874db14a159fb7865d830a7d60c4414d45c4031"><code>6874db1</code></a> Update vacuum manifest</li> <li>Additional commits viewable in <a href="https://github.com/taiki-e/install-action/compare/85b24a67ef0c632dfefad70b9d5ce8fddb040754...055f5df8c3f65ea01cd41e9dc855becd88953486">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=taiki-e/install-action&package-manager=github_actions&previous-version=2.75.10&new-version=2.75.18)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Each Parquet file previously produced a single morsel containing one `ParquetPushDecoder` over the full pruned `ParquetAccessPlan`. Morselize at row-group granularity instead: after all pruning work is done, pack surviving row groups into chunks bounded by a per-morsel row budget and compressed-byte budget (defaults: 100k rows, 64 MiB). Each chunk becomes its own stream so the executor can interleave row-group decode work with other operators and — in a follow-up — let sibling `FileStream`s steal row-group-sized units of work across partitions. A single oversized row group still becomes its own morsel; no sub-row-group splitting is introduced. `EarlyStoppingStream` (which is driven by the non-Clone `FilePruner`) is attached only to the first morsel's stream so the whole file can still short-circuit on dynamic-filter narrowing. Row-group reversal is applied per-chunk on the `PreparedAccessPlan` and the chunk list is reversed so reverse output order is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous `build_stream` built every morsel's `RowFilter`, `ParquetPushDecoder`, `AsyncFileReader`, and `Projector` eagerly in a single loop inside the file planner — before any morsel was scheduled. That loop ran on the scheduler thread and was visible as a 10–15% regression vs. main on ClickBench-partitioned queries that have many row-group morsels per file (e.g. Q15, Q16 at pushdown=off). Replace `ParquetStreamMorsel` (which held a pre-built `BoxStream`) with `ParquetLazyMorsel`, which holds only the per-chunk `ParquetAccessPlan` plus an `Arc<LazyMorselShared>` of the file-level state. The decoder and reader are constructed inside `Morsel::into_stream`, so each morsel pays its setup cost only when the scheduler actually picks it up, and the work is distributed across worker threads instead of serialised on the planner. `FilePruner` is `!Clone` and drives whole-file early-stop via `EarlyStoppingStream`, so it still lives on chunk 0's morsel only. The warm `async_file_reader` from metadata / page-index / bloom-filter load is dropped at the end of `build_stream` — every morsel mints a fresh reader via the factory at `into_stream` time. For both built-in factories (`DefaultParquetFileReaderFactory`, `CachedParquetFileReaderFactory`) the "warm cache" benefit of reusing a reader is negligible because the underlying `Arc<dyn ObjectStore>` / `Arc<dyn FileMetadataCache>` is already shared across readers, so the simplification is free. Local ClickBench-partitioned, 10 iterations, pushdown=off (M-series): | Query | main | eager (before) | lazy (this commit) | |-------|------:|---------------:|-------------------:| | Q14 | 325 | 335 | 313 ms | | Q15 | 309 | 358 | 302 ms | | Q16 | 911 | 1049 | 786 ms | | Q24 | 48 | 55 | 56 ms | | Q26 | 41 | 45 | 45 ms | Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added the datasource label Apr 20, 2026

adriangb mentioned this pull request Apr 20, 2026

Adaptive filter scheduling + row-group morsel split adriangb/datafusion#9

Open

5 tasks

comphead and others added 10 commits April 21, 2026 00:27

adriangb force-pushed the row-group-morsel-split branch from 311a854 to 5b0a69a Compare April 21, 2026 13:01

github-actions Bot added documentation Improvements or additions to documentation sqllogictest common physical-plan logical-expr development-process execution functions labels Apr 21, 2026

adriangb force-pushed the row-group-morsel-split branch from 5b0a69a to ff805cf Compare April 21, 2026 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: split Parquet files into row-group-sized morsels#59

feat: split Parquet files into row-group-sized morsels#59
adriangb wants to merge 12 commits intomainfrom
row-group-morsel-split

adriangb commented Apr 20, 2026

Uh oh!

adriangb commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

adriangb commented Apr 20, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

adriangb commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants