feat: split Parquet files into row-group-sized morsels#59
Draft
feat: split Parquet files into row-group-sized morsels#59
Conversation
Member
Author
|
run benchmarks |
5 tasks
…pache#21708) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes #. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
…nputs (apache#21704) ## Which issue does this PR close? - Closes apache#21702. ## Rationale for this change `array_concat` hit an internal cast error when given a mix of `List` and `LargeList` (or `FixedSizeList` and `LargeList`) arguments: ```sql > select array_concat(make_array(1, 2), arrow_cast([3, 4], 'LargeList(Int64)')); DataFusion error: Internal error: could not cast array of type List(Int64) to arrow_array::array::list_array::GenericListArray<i64>. ``` `ArrayConcat::coerce_types` was coercing only the base element type, leaving the outer container alone. When the resolved return type is `LargeList`, `array_concat_inner` later tries to downcast each arg to `GenericListArray<i64>`, which fails for any `List` argument that slipped through. ## What changes are included in this PR? In `ArrayConcat::coerce_types`, after coercing the base type, also promote each input's outermost `List` to `LargeList` when the return type is a `LargeList`. `FixedSizeList` inputs already go through `FixedSizedListToList` first and then get promoted too. Per-arg dimensionality is preserved, so nested cases keep working with `align_array_dimensions`. ## Are these changes tested? Yes, added sqllogictests in `array_concat.slt` covering: - `List` + `LargeList` - `LargeList` + `List` - `FixedSizeList` + `LargeList` - Three-way mix `List`, `LargeList`, `List` Each one also asserts `arrow_typeof(...) = LargeList(Int64)`. ## Are there any user-facing changes? Queries that previously returned an internal cast error now return the concatenated `LargeList` as expected. No API changes.
…messages (apache#20387) ## Which issue does this PR close? - Closes apache#20386. ## Rationale for this change `memory_limit` (`RuntimeEnvBuilder::new().with_memory_limit()`) configuration uses `greedy` memory pool as `default`. However, if `memory_pool` (`RuntimeEnvBuilder::new().with_memory_pool()`) is set, it overrides by expected `memory_pool` config such as `fair`. Also, if both `memory_limit` and `memory_pool` configs are not set, `unbounded` memory pool will be used so it can be useful to expose `ultimately used/selected pool` as part of `ResourcesExhausted` error message for the end user awareness and the user may need to switch used memory pool (`greedy`, `fair`, `unbounded`), - Also, [this comparison table](lance-format/lance#3601 (comment)) is an example use-case for both `greedy` and `fair` memory pools runtime behaviors and this addition can help for this kind of comparison table by exposing used memory pool info as part of native logs. Please find following example use-cases by `datafusion-cli`: **Case1**: datafusion-cli result when `memory-limit` and `top-memory-consumers > 0` are set: ``` eren.avsarogullari@AWGNPWVK961 debug % ./datafusion-cli --memory-limit 10M --command 'select * from generate_series(1,500000) as t1(v1) order by v1;' --top-memory-consumers 3 DataFusion CLI v53.0.0 Error: Not enough memory to continue external sort. Consider increasing the memory limit config: 'datafusion.runtime.memory_limit', or decreasing the config: 'datafusion.execution.sort_spill_reservation_bytes'. caused by Resources exhausted: Additional allocation failed for ExternalSorter[0] with top memory consumers (across reservations) as: ExternalSorterMerge[0]#2(can spill: false) consumed 10.0 MB, peak 10.0 MB, DataFusion-Cli#0(can spill: false) consumed 0.0 B, peak 0.0 B, ExternalSorter[0]#1(can spill: true) consumed 0.0 B, peak 0.0 B. Error: Failed to allocate additional 128.0 KB for ExternalSorter[0] with 0.0 B already allocated for this reservation - 0.0 B remain available for the total memory pool: greedy(used: 10.0 MB, pool_size: 10.0 MB) ``` **Case2**: datafusion-cli result when `memory-limit` and `top-memory-consumers = 0` (disabling top memory consumers logging) are set: ``` eren.avsarogullari@AWGNPWVK961 debug % ./datafusion-cli --memory-limit 10M --command 'select * from generate_series(1,500000) as t1(v1) order by v1;' --top-memory-consumers 0 DataFusion CLI v53.0.0 Error: Not enough memory to continue external sort. Consider increasing the memory limit config: 'datafusion.runtime.memory_limit', or decreasing the config: 'datafusion.execution.sort_spill_reservation_bytes'. caused by Resources exhausted: Failed to allocate additional 128.0 KB for ExternalSorter[0] with 0.0 B already allocated for this reservation - 0.0 B remain available for the total memory pool: greedy(used: 10.0 MB, pool_size: 10.0 MB) ``` **Case3**: datafusion-cli result when only `memory-limit`, `memory-pool` and `top-memory-consumers > 0` are set: ``` eren.avsarogullari@AWGNPWVK961 debug % ./datafusion-cli --memory-limit 10M --mem-pool-type fair --top-memory-consumers 3 --command 'select * from generate_series(1,500000) as t1(v1) order by v1;' DataFusion CLI v53.0.0 Error: Not enough memory to continue external sort. Consider increasing the memory limit config: 'datafusion.runtime.memory_limit', or decreasing the config: 'datafusion.execution.sort_spill_reservation_bytes'. caused by Resources exhausted: Additional allocation failed for ExternalSorter[0] with top memory consumers (across reservations) as: ExternalSorterMerge[0]#2(can spill: false) consumed 10.0 MB, peak 10.0 MB, ExternalSorter[0]#1(can spill: true) consumed 0.0 B, peak 0.0 B, DataFusion-Cli#0(can spill: false) consumed 0.0 B, peak 0.0 B. Error: Failed to allocate additional 128.0 KB for ExternalSorter[0] with 0.0 B already allocated for this reservation - 0.0 B remain available for the total memory pool: fair(pool_size: 10.0 MB) ``` ## What changes are included in this PR? - Adding name property to MemoryPool instances, - Expose used MemoryPool info to Resources Exhausted error messages ## Are these changes tested? Yes and updating existing test cases. ## Are there any user-facing changes? Yes, being updated Resources Exhausted error messages.
…pache#21749) ## Which issue does this PR close? - Closes apache#21751. ## Rationale for this change Profiling the planner suggests that a surprising amount of time was being spent doing tree rewriting in the logical optimizer. One culprit is `TreeNodeContainer::map_elements()` for `Box<C>` and `Arc<C>`, which do the following: * Fetch the inner `C` value from the `Box`/`Arc` * Pass the innter value to the closure * Wrap the return value of the closure in a newly allocated `Box` / `Arc`, respectively This allocates a fresh `Box` or `Arc` for every node visited while walking an expression or logical plan, even if the tree rewrite we're doing didn't modify the expression/plan node. Instead, we can reuse the current `Box<C>` or `Arc<C>`: use `std::mem::take()` to swap the inner value with `C::default()`, pass the inner value to the closure, and put the result back in the original container. Swapping the inner value with `C::default()` means the container always has a valid value, which is important if the closure panics. For `Arc<C>`, we need to use `Arc::make_mut()`, which only clones if the `Arc` is not unique. This reduces the bytes allocated to plan TPC-H Q13 by ~22% (988 kB -> 765 kB), and reduces allocated blocks by 8.5% (210k -> 192k). ## What changes are included in this PR? * Optimize `Box<C>::map_elements()` and `Arc<C>::map_elements()` as described above * Change `map_children()` for `Expr::Alias` to use `map_elements()`, rather than invoking `f(*expr)` directly; this ensures that it can take advantage of this optimization * Make `LogicalPlan::default()` use a shared `DFSchema`, rather than allocating a fresh `DFSchema` for every call. Because `default()` is not in the hot path for tree rewriting, it is important that it is cheap * Add unit tests for new `map_elements()` behavior * Add note to migration guide for breaking API change ## Are these changes tested? Yes, plus new unit tests added. ## Are there any user-facing changes? Yes: `TreeNodeContainer` impls for `Box<C>` and `Arc<C>` now require `C: Default`. This is a breaking API change for third-party code that implements `TreeNodeContainer` for a custom type. The fix is usually straightforward.
…nts (apache#20904) ## Which issue does this PR close? Does not close but part of apache#20766 ## Rationale for this change Details are in apache#20766. But main idea is to use existing distinct count information to optimize joins similar to how Spark/Trino does ## What changes are included in this PR? This PR extends cardinality estimation for semi/anti joins using distinct counts ## Are these changes tested? I've added cases but not sure if I should've added benchmarks on this. ## Are there any user-facing changes? No --------- Co-authored-by: Alessandro Solimando <alessandro.solimando@gmail.com>
Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from 8.0.0 to 8.1.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/astral-sh/setup-uv/releases">astral-sh/setup-uv's releases</a>.</em></p> <blockquote> <h2>v8.1.0 🌈 New input <code>no-project</code></h2> <h2>Changes</h2> <p>This add the a new boolean input <code>no-project</code>. It only makes sense to use in combination with <code>activate-environment: true</code> and will append <code>--no project</code> to the <code>uv venv</code> call. This is for example useful <a href="https://redirect.github.com/astral-sh/setup-uv/issues/854">if you have a pyproject.toml file with parts unparseable by uv</a></p> <h2>🚀 Enhancements</h2> <ul> <li>Add input no-project in combination with activate-environment <a href="https://github.com/eifinger"><code>@eifinger</code></a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/856">#856</a>)</li> </ul> <h2>🧰 Maintenance</h2> <ul> <li>fix: grant contents:write to validate-release job <a href="https://github.com/eifinger"><code>@eifinger</code></a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/860">#860</a>)</li> <li>Add a release-gate step to the release workflow <a href="https://github.com/zanieb"><code>@zanieb</code></a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/859">#859</a>)</li> <li>Draft commitish releases <a href="https://github.com/eifinger"><code>@eifinger</code></a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/858">#858</a>)</li> <li>Add action-types.yml to instructions <a href="https://github.com/eifinger"><code>@eifinger</code></a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/857">#857</a>)</li> <li>chore: update known checksums for 0.11.7 @<a href="https://github.com/apps/github-actions">github-actions[bot]</a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/853">#853</a>)</li> <li>Refactor version resolving <a href="https://github.com/eifinger"><code>@eifinger</code></a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/852">#852</a>)</li> <li>chore: update known checksums for 0.11.6 @<a href="https://github.com/apps/github-actions">github-actions[bot]</a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/850">#850</a>)</li> <li>chore: update known checksums for 0.11.5 @<a href="https://github.com/apps/github-actions">github-actions[bot]</a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/845">#845</a>)</li> <li>chore: update known checksums for 0.11.4 @<a href="https://github.com/apps/github-actions">github-actions[bot]</a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/843">#843</a>)</li> <li>Add a release workflow <a href="https://github.com/zanieb"><code>@zanieb</code></a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/839">#839</a>)</li> <li>chore: update known checksums for 0.11.3 @<a href="https://github.com/apps/github-actions">github-actions[bot]</a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/836">#836</a>)</li> </ul> <h2>📚 Documentation</h2> <ul> <li>Update ignore-nothing-to-cache documentation <a href="https://github.com/eifinger"><code>@eifinger</code></a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/833">#833</a>)</li> <li>Pin setup-uv docs to v8 <a href="https://github.com/eifinger"><code>@eifinger</code></a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/829">#829</a>)</li> </ul> <h2>⬆️ Dependency updates</h2> <ul> <li>chore(deps): bump release-drafter/release-drafter from 7.1.1 to 7.2.0 @<a href="https://github.com/apps/dependabot">dependabot[bot]</a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/855">#855</a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/astral-sh/setup-uv/commit/08807647e7069bb48b6ef5acd8ec9567f424441b"><code>0880764</code></a> fix: grant contents:write to validate-release job (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/860">#860</a>)</li> <li><a href="https://github.com/astral-sh/setup-uv/commit/717d6aba0f15312f509f5c4999e34d71ecbab8a9"><code>717d6ab</code></a> Add a release-gate step to the release workflow (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/859">#859</a>)</li> <li><a href="https://github.com/astral-sh/setup-uv/commit/5a911eb3a3983b5e650f2dad95c1ce698ca94378"><code>5a911eb</code></a> Draft commitish releases (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/858">#858</a>)</li> <li><a href="https://github.com/astral-sh/setup-uv/commit/080c31e04cd7155b0ca676d08c7bc260a4476a23"><code>080c31e</code></a> Add action-types.yml to instructions (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/857">#857</a>)</li> <li><a href="https://github.com/astral-sh/setup-uv/commit/b3e97d2ba1a1eed7e9d1f8456dd06c3b725bc3a6"><code>b3e97d2</code></a> Add input no-project in combination with activate-environment (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/856">#856</a>)</li> <li><a href="https://github.com/astral-sh/setup-uv/commit/7dd591db9557f680290587fcc578372813b9ff64"><code>7dd591d</code></a> chore(deps): bump release-drafter/release-drafter from 7.1.1 to 7.2.0 (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/855">#855</a>)</li> <li><a href="https://github.com/astral-sh/setup-uv/commit/1541b7762698877904805605192ecd63d0e4787a"><code>1541b77</code></a> chore: update known checksums for 0.11.7 (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/853">#853</a>)</li> <li><a href="https://github.com/astral-sh/setup-uv/commit/cdfb2ee6dde255817c739680168ad81e184c4bfb"><code>cdfb2ee</code></a> Refactor version resolving (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/852">#852</a>)</li> <li><a href="https://github.com/astral-sh/setup-uv/commit/cb84d12dc6a0d495b82fcae14fa4559b90698660"><code>cb84d12</code></a> chore: update known checksums for 0.11.6 (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/850">#850</a>)</li> <li><a href="https://github.com/astral-sh/setup-uv/commit/1912cc65f2e839707d7a16f2372f30b57d35fd80"><code>1912cc6</code></a> chore: update known checksums for 0.11.5 (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/845">#845</a>)</li> <li>Additional commits viewable in <a href="https://github.com/astral-sh/setup-uv/compare/cec208311dfd045dd5311c1add060b2062131d57...08807647e7069bb48b6ef5acd8ec9567f424441b">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes #. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> One test case in `datafusion-cli` crate is failing locally if you run all tests through `cargo nextest run`, but passes for `cargo test` ``` FAIL [ 0.375s] datafusion-cli::cli_integration cli_explain_environment_overrides ``` The reason is `nextest` triggers a different build graph, which enforces a feature flag in `serde_json` dependency. This PR enforces this feature in the `dev-dependencies` in `datafusion-cli` crate, so the test become deterministic under different test setup. apache#21502 Fixed a similar issue, and also explains why not enabling it in the global dependencies inside `Cargo.toml` ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
…argo-deps group (apache#21760) Bumps the all-other-cargo-deps group with 1 update: [aws-config](https://github.com/smithy-lang/smithy-rs). Updates `aws-config` from 1.8.15 to 1.8.16 <details> <summary>Commits</summary> <ul> <li>See full diff in <a href="https://github.com/smithy-lang/smithy-rs/commits">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore <dependency name> major version` will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself) - `@dependabot ignore <dependency name> minor version` will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself) - `@dependabot ignore <dependency name>` will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself) - `@dependabot unignore <dependency name>` will remove all of the ignore conditions of the specified dependency - `@dependabot unignore <dependency name> <ignore condition>` will remove the ignore condition of the specified dependency and ignore conditions </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…21758) Bumps [github/codeql-action](https://github.com/github/codeql-action) from 4.35.1 to 4.35.2. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/github/codeql-action/releases">github/codeql-action's releases</a>.</em></p> <blockquote> <h2>v4.35.2</h2> <ul> <li>The undocumented TRAP cache cleanup feature that could be enabled using the <code>CODEQL_ACTION_CLEANUP_TRAP_CACHES</code> environment variable is deprecated and will be removed in May 2026. If you are affected by this, we recommend disabling TRAP caching by passing the <code>trap-caching: false</code> input to the <code>init</code> Action. <a href="https://redirect.github.com/github/codeql-action/pull/3795">#3795</a></li> <li>The Git version 2.36.0 requirement for improved incremental analysis now only applies to repositories that contain submodules. <a href="https://redirect.github.com/github/codeql-action/pull/3789">#3789</a></li> <li>Python analysis on GHES no longer extracts the standard library, relying instead on models of the standard library. This should result in significantly faster extraction and analysis times, while the effect on alerts should be minimal. <a href="https://redirect.github.com/github/codeql-action/pull/3794">#3794</a></li> <li>Fixed a bug in the validation of OIDC configurations for private registries that was added in CodeQL Action 4.33.0 / 3.33.0. <a href="https://redirect.github.com/github/codeql-action/pull/3807">#3807</a></li> <li>Update default CodeQL bundle version to <a href="https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.25.2">2.25.2</a>. <a href="https://redirect.github.com/github/codeql-action/pull/3823">#3823</a></li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/github/codeql-action/blob/main/CHANGELOG.md">github/codeql-action's changelog</a>.</em></p> <blockquote> <h1>CodeQL Action Changelog</h1> <p>See the <a href="https://github.com/github/codeql-action/releases">releases page</a> for the relevant changes to the CodeQL CLI and language packs.</p> <h2>[UNRELEASED]</h2> <p>No user facing changes.</p> <h2>4.35.2 - 15 Apr 2026</h2> <ul> <li>The undocumented TRAP cache cleanup feature that could be enabled using the <code>CODEQL_ACTION_CLEANUP_TRAP_CACHES</code> environment variable is deprecated and will be removed in May 2026. If you are affected by this, we recommend disabling TRAP caching by passing the <code>trap-caching: false</code> input to the <code>init</code> Action. <a href="https://redirect.github.com/github/codeql-action/pull/3795">#3795</a></li> <li>The Git version 2.36.0 requirement for improved incremental analysis now only applies to repositories that contain submodules. <a href="https://redirect.github.com/github/codeql-action/pull/3789">#3789</a></li> <li>Python analysis on GHES no longer extracts the standard library, relying instead on models of the standard library. This should result in significantly faster extraction and analysis times, while the effect on alerts should be minimal. <a href="https://redirect.github.com/github/codeql-action/pull/3794">#3794</a></li> <li>Fixed a bug in the validation of OIDC configurations for private registries that was added in CodeQL Action 4.33.0 / 3.33.0. <a href="https://redirect.github.com/github/codeql-action/pull/3807">#3807</a></li> <li>Update default CodeQL bundle version to <a href="https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.25.2">2.25.2</a>. <a href="https://redirect.github.com/github/codeql-action/pull/3823">#3823</a></li> </ul> <h2>4.35.1 - 27 Mar 2026</h2> <ul> <li>Fix incorrect minimum required Git version for <a href="https://redirect.github.com/github/roadmap/issues/1158">improved incremental analysis</a>: it should have been 2.36.0, not 2.11.0. <a href="https://redirect.github.com/github/codeql-action/pull/3781">#3781</a></li> </ul> <h2>4.35.0 - 27 Mar 2026</h2> <ul> <li>Reduced the minimum Git version required for <a href="https://redirect.github.com/github/roadmap/issues/1158">improved incremental analysis</a> from 2.38.0 to 2.11.0. <a href="https://redirect.github.com/github/codeql-action/pull/3767">#3767</a></li> <li>Update default CodeQL bundle version to <a href="https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.25.1">2.25.1</a>. <a href="https://redirect.github.com/github/codeql-action/pull/3773">#3773</a></li> </ul> <h2>4.34.1 - 20 Mar 2026</h2> <ul> <li>Downgrade default CodeQL bundle version to <a href="https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.24.3">2.24.3</a> due to issues with a small percentage of Actions and JavaScript analyses. <a href="https://redirect.github.com/github/codeql-action/pull/3762">#3762</a></li> </ul> <h2>4.34.0 - 20 Mar 2026</h2> <ul> <li>Added an experimental change which disables TRAP caching when <a href="https://redirect.github.com/github/roadmap/issues/1158">improved incremental analysis</a> is enabled, since improved incremental analysis supersedes TRAP caching. This will improve performance and reduce Actions cache usage. We expect to roll this change out to everyone in March. <a href="https://redirect.github.com/github/codeql-action/pull/3569">#3569</a></li> <li>We are rolling out improved incremental analysis to C/C++ analyses that use build mode <code>none</code>. We expect this rollout to be complete by the end of April 2026. <a href="https://redirect.github.com/github/codeql-action/pull/3584">#3584</a></li> <li>Update default CodeQL bundle version to <a href="https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.25.0">2.25.0</a>. <a href="https://redirect.github.com/github/codeql-action/pull/3585">#3585</a></li> </ul> <h2>4.33.0 - 16 Mar 2026</h2> <ul> <li> <p>Upcoming change: Starting April 2026, the CodeQL Action will skip collecting file coverage information on pull requests to improve analysis performance. File coverage information will still be computed on non-PR analyses. Pull request analyses will log a warning about this upcoming change. <a href="https://redirect.github.com/github/codeql-action/pull/3562">#3562</a></p> <p>To opt out of this change:</p> <ul> <li><strong>Repositories owned by an organization:</strong> Create a custom repository property with the name <code>github-codeql-file-coverage-on-prs</code> and the type "True/false", then set this property to <code>true</code> in the repository's settings. For more information, see <a href="https://docs.github.com/en/organizations/managing-organization-settings/managing-custom-properties-for-repositories-in-your-organization">Managing custom properties for repositories in your organization</a>. Alternatively, if you are using an advanced setup workflow, you can set the <code>CODEQL_ACTION_FILE_COVERAGE_ON_PRS</code> environment variable to <code>true</code> in your workflow.</li> <li><strong>User-owned repositories using default setup:</strong> Switch to an advanced setup workflow and set the <code>CODEQL_ACTION_FILE_COVERAGE_ON_PRS</code> environment variable to <code>true</code> in your workflow.</li> <li><strong>User-owned repositories using advanced setup:</strong> Set the <code>CODEQL_ACTION_FILE_COVERAGE_ON_PRS</code> environment variable to <code>true</code> in your workflow.</li> </ul> </li> <li> <p>Fixed <a href="https://redirect.github.com/github/codeql-action/issues/3555">a bug</a> which caused the CodeQL Action to fail loading repository properties if a "Multi select" repository property was configured for the repository. <a href="https://redirect.github.com/github/codeql-action/pull/3557">#3557</a></p> </li> <li> <p>The CodeQL Action now loads <a href="https://docs.github.com/en/organizations/managing-organization-settings/managing-custom-properties-for-repositories-in-your-organization">custom repository properties</a> on GitHub Enterprise Server, enabling the customization of features such as <code>github-codeql-disable-overlay</code> that was previously only available on GitHub.com. <a href="https://redirect.github.com/github/codeql-action/pull/3559">#3559</a></p> </li> <li> <p>Once <a href="https://docs.github.com/en/code-security/how-tos/secure-at-scale/configure-organization-security/manage-usage-and-access/giving-org-access-private-registries">private package registries</a> can be configured with OIDC-based authentication for organizations, the CodeQL Action will now be able to accept such configurations. <a href="https://redirect.github.com/github/codeql-action/pull/3563">#3563</a></p> </li> <li> <p>Fixed the retry mechanism for database uploads. Previously this would fail with the error "Response body object should not be disturbed or locked". <a href="https://redirect.github.com/github/codeql-action/pull/3564">#3564</a></p> </li> <li> <p>A warning is now emitted if the CodeQL Action detects a repository property whose name suggests that it relates to the CodeQL Action, but which is not one of the properties recognised by the current version of the CodeQL Action. <a href="https://redirect.github.com/github/codeql-action/pull/3570">#3570</a></p> </li> </ul> <h2>4.32.6 - 05 Mar 2026</h2> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/github/codeql-action/commit/95e58e9a2cdfd71adc6e0353d5c52f41a045d225"><code>95e58e9</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3824">#3824</a> from github/update-v4.35.2-d2e135a73</li> <li><a href="https://github.com/github/codeql-action/commit/6f31bfe060e817d81e938dbec767969d20031e25"><code>6f31bfe</code></a> Update changelog for v4.35.2</li> <li><a href="https://github.com/github/codeql-action/commit/d2e135a73a39154e3a231aeb49163c4661c5b8b1"><code>d2e135a</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3823">#3823</a> from github/update-bundle/codeql-bundle-v2.25.2</li> <li><a href="https://github.com/github/codeql-action/commit/60abb65df09fcf213c398e064c8a80db1f15cdaf"><code>60abb65</code></a> Add changelog note</li> <li><a href="https://github.com/github/codeql-action/commit/5a0a562209255e956ad8aafcee303294e64eefa2"><code>5a0a562</code></a> Update default bundle to codeql-bundle-v2.25.2</li> <li><a href="https://github.com/github/codeql-action/commit/65216971a11ded447a6b76263d5a144519e5eee1"><code>6521697</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3820">#3820</a> from github/dependabot/github_actions/dot-github/wor...</li> <li><a href="https://github.com/github/codeql-action/commit/3c45af2dd258e1623af1898da5c86545b514e028"><code>3c45af2</code></a> Merge pull request <a href="https://redirect.github.com/github/codeql-action/issues/3821">#3821</a> from github/dependabot/npm_and_yarn/npm-minor-345b93...</li> <li><a href="https://github.com/github/codeql-action/commit/f1c339364c12f922998186ed897e45e3b4ae8874"><code>f1c3393</code></a> Rebuild</li> <li><a href="https://github.com/github/codeql-action/commit/1024fc496c87e944a93e98d8cf2c09e2c7602a30"><code>1024fc4</code></a> Rebuild</li> <li><a href="https://github.com/github/codeql-action/commit/9dd4cfed96030ccdfe1af4daf7a7964322704fed"><code>9dd4cfe</code></a> Bump the npm-minor group across 1 directory with 6 updates</li> <li>Additional commits viewable in <a href="https://github.com/github/codeql-action/compare/c10b8064de6f491fea524254123dbe5e09572f13...95e58e9a2cdfd71adc6e0353d5c52f41a045d225">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…che#21757) Bumps [taiki-e/install-action](https://github.com/taiki-e/install-action) from 2.75.10 to 2.75.18. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/taiki-e/install-action/releases">taiki-e/install-action's releases</a>.</em></p> <blockquote> <h2>2.75.18</h2> <ul> <li> <p>Update <code>vacuum@latest</code> to 0.26.1.</p> </li> <li> <p>Update <code>wasm-tools@latest</code> to 1.247.0.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.16.</p> </li> <li> <p>Update <code>espup@latest</code> to 0.17.1.</p> </li> <li> <p>Update <code>trivy@latest</code> to 0.70.0.</p> </li> </ul> <h2>2.75.17</h2> <ul> <li> <p>Update <code>tombi@latest</code> to 0.9.18.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.15.</p> </li> </ul> <h2>2.75.16</h2> <ul> <li> <p>Update <code>uv@latest</code> to 0.11.7.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.14.</p> </li> <li> <p>Update <code>vacuum@latest</code> to 0.25.9.</p> </li> <li> <p>Update <code>cargo-machete@latest</code> to 0.9.2.</p> </li> <li> <p>Update <code>cargo-deny@latest</code> to 0.19.4.</p> </li> </ul> <h2>2.75.15</h2> <ul> <li> <p>Update <code>cargo-nextest@latest</code> to 0.9.133.</p> </li> <li> <p>Update <code>biome@latest</code> to 2.4.12.</p> </li> </ul> <h2>2.75.14</h2> <ul> <li> <p>Implement potential workaround for <a href="https://redirect.github.com/actions/partner-runner-images/issues/169">windows-11-arm runner bug</a> which sometimes causes installation failure.</p> <p>The issue where this bug affected the startup of bash was addressed in 2.71.2, but we received a report that the <a href="https://redirect.github.com/taiki-e/install-action/pull/1657#issuecomment-4252717651">same problem seems to occur when starting other commands as well</a>.</p> </li> <li> <p>Update <code>cargo-deny@latest</code> to 0.19.2.</p> </li> </ul> <h2>2.75.13</h2> <ul> <li>Update <code>zizmor@latest</code> to 1.24.1.</li> </ul> <h2>2.75.12</h2> <ul> <li> <p>Update <code>typos@latest</code> to 1.45.1.</p> </li> <li> <p>Update <code>cargo-xwin@latest</code> to 0.21.5.</p> </li> <li> <p>Update <code>cargo-binstall@latest</code> to 1.18.1.</p> </li> </ul> <h2>2.75.11</h2> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/taiki-e/install-action/blob/main/CHANGELOG.md">taiki-e/install-action's changelog</a>.</em></p> <blockquote> <h1>Changelog</h1> <p>All notable changes to this project will be documented in this file.</p> <p>This project adheres to <a href="https://semver.org">Semantic Versioning</a>.</p> <!-- raw HTML omitted --> <h2>[Unreleased]</h2> <ul> <li> <p>Update <code>tombi@latest</code> to 0.9.20.</p> </li> <li> <p>Update <code>martin@latest</code> to 1.6.0.</p> </li> <li> <p>Update <code>just@latest</code> to 1.50.0.</p> </li> <li> <p>Update <code>tombi@latest</code> to 0.9.19.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.18.</p> </li> <li> <p>Update <code>rclone@latest</code> to 1.73.5.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.17.</p> </li> </ul> <h2>[2.75.18] - 2026-04-19</h2> <ul> <li> <p>Update <code>vacuum@latest</code> to 0.26.1.</p> </li> <li> <p>Update <code>wasm-tools@latest</code> to 1.247.0.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.16.</p> </li> <li> <p>Update <code>espup@latest</code> to 0.17.1.</p> </li> <li> <p>Update <code>trivy@latest</code> to 0.70.0.</p> </li> </ul> <h2>[2.75.17] - 2026-04-17</h2> <ul> <li> <p>Update <code>tombi@latest</code> to 0.9.18.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.15.</p> </li> </ul> <h2>[2.75.16] - 2026-04-17</h2> <ul> <li> <p>Update <code>uv@latest</code> to 0.11.7.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.14.</p> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/taiki-e/install-action/commit/055f5df8c3f65ea01cd41e9dc855becd88953486"><code>055f5df</code></a> Release 2.75.18</li> <li><a href="https://github.com/taiki-e/install-action/commit/eabf60349346950549ed65f6beb018b4680f7968"><code>eabf603</code></a> Add note about unset</li> <li><a href="https://github.com/taiki-e/install-action/commit/4637b48a5ac188fd1395ec47093a2f53f6e1a2b3"><code>4637b48</code></a> Early handle inputs</li> <li><a href="https://github.com/taiki-e/install-action/commit/7a6306ece23f52d1c9356f8fe0d0dd0f791c7825"><code>7a6306e</code></a> Update <code>vacuum@latest</code> to 0.26.1</li> <li><a href="https://github.com/taiki-e/install-action/commit/cb13f5ef5263e03d2a7c5675b24ba8374dab72b4"><code>cb13f5e</code></a> Update mise manifest</li> <li><a href="https://github.com/taiki-e/install-action/commit/18cc1a4fb7bd8a9c7c6fc69fda6c5b6b6c477b3c"><code>18cc1a4</code></a> Update <code>wasm-tools@latest</code> to 1.247.0</li> <li><a href="https://github.com/taiki-e/install-action/commit/c7b05077fec4d0c69ebf2b84456491ae0e31295d"><code>c7b0507</code></a> Update <code>mise@latest</code> to 2026.4.16</li> <li><a href="https://github.com/taiki-e/install-action/commit/0ef4e7650f60cd0dce197648e865d433e0a15151"><code>0ef4e76</code></a> Update <code>espup@latest</code> to 0.17.1</li> <li><a href="https://github.com/taiki-e/install-action/commit/56ec35f1c0ea059ed79d67351a8376410b7a3c87"><code>56ec35f</code></a> Update <code>trivy@latest</code> to 0.70.0</li> <li><a href="https://github.com/taiki-e/install-action/commit/6874db14a159fb7865d830a7d60c4414d45c4031"><code>6874db1</code></a> Update vacuum manifest</li> <li>Additional commits viewable in <a href="https://github.com/taiki-e/install-action/compare/85b24a67ef0c632dfefad70b9d5ce8fddb040754...055f5df8c3f65ea01cd41e9dc855becd88953486">compare view</a></li> </ul> </details> <br /> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
311a854 to
5b0a69a
Compare
Each Parquet file previously produced a single morsel containing one `ParquetPushDecoder` over the full pruned `ParquetAccessPlan`. Morselize at row-group granularity instead: after all pruning work is done, pack surviving row groups into chunks bounded by a per-morsel row budget and compressed-byte budget (defaults: 100k rows, 64 MiB). Each chunk becomes its own stream so the executor can interleave row-group decode work with other operators and — in a follow-up — let sibling `FileStream`s steal row-group-sized units of work across partitions. A single oversized row group still becomes its own morsel; no sub-row-group splitting is introduced. `EarlyStoppingStream` (which is driven by the non-Clone `FilePruner`) is attached only to the first morsel's stream so the whole file can still short-circuit on dynamic-filter narrowing. Row-group reversal is applied per-chunk on the `PreparedAccessPlan` and the chunk list is reversed so reverse output order is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5b0a69a to
ff805cf
Compare
The previous `build_stream` built every morsel's `RowFilter`, `ParquetPushDecoder`, `AsyncFileReader`, and `Projector` eagerly in a single loop inside the file planner — before any morsel was scheduled. That loop ran on the scheduler thread and was visible as a 10–15% regression vs. main on ClickBench-partitioned queries that have many row-group morsels per file (e.g. Q15, Q16 at pushdown=off). Replace `ParquetStreamMorsel` (which held a pre-built `BoxStream`) with `ParquetLazyMorsel`, which holds only the per-chunk `ParquetAccessPlan` plus an `Arc<LazyMorselShared>` of the file-level state. The decoder and reader are constructed inside `Morsel::into_stream`, so each morsel pays its setup cost only when the scheduler actually picks it up, and the work is distributed across worker threads instead of serialised on the planner. `FilePruner` is `!Clone` and drives whole-file early-stop via `EarlyStoppingStream`, so it still lives on chunk 0's morsel only. The warm `async_file_reader` from metadata / page-index / bloom-filter load is dropped at the end of `build_stream` — every morsel mints a fresh reader via the factory at `into_stream` time. For both built-in factories (`DefaultParquetFileReaderFactory`, `CachedParquetFileReaderFactory`) the "warm cache" benefit of reusing a reader is negligible because the underlying `Arc<dyn ObjectStore>` / `Arc<dyn FileMetadataCache>` is already shared across readers, so the simplification is free. Local ClickBench-partitioned, 10 iterations, pushdown=off (M-series): | Query | main | eager (before) | lazy (this commit) | |-------|------:|---------------:|-------------------:| | Q14 | 325 | 335 | 313 ms | | Q15 | 309 | 358 | 302 ms | | Q16 | 911 | 1049 | 786 ms | | Q24 | 48 | 55 | 56 ms | | Q26 | 41 | 45 | 45 ms | Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
Follow-up to PR apache#21351 (Dynamic work scheduling in FileStream), which explicitly called out "Splitting files into smaller units (e.g. across row groups)" as a deferred next step. Today each Parquet file becomes exactly one morsel — a single
ParquetPushDecoderover the entire prunedParquetAccessPlan. That coarse granularity:FileStreams" effort blocked — there's no row-group-sized unit for theSharedWorkSourceto hand out;LIMITand dynamic-filter early-stop wait for whole-file granularity.The
Morselizer/MorselPlannerabstraction already supports returningVec<Box<dyn Morsel>>from a singleMorselPlan, andScanStatealready drains multi-morsel plans FIFO. This PR takes advantage of that:RowGroupsPrunedParquetOpen::build_streamnow emits N streams, one per row-group chunk.What changes are included in this PR?
ParquetAccessPlan::split_into_chunks(new,pub(crate)). Packs consecutive surviving row groups into chunks bounded by a row budget (default 100k rows) and a compressed-byte budget (default 64 MiB). A single oversized row group still becomes its own chunk; no sub-row-group splitting is introduced in this PR.Skipentries are carried by the currently open chunk without forcing a boundary.RowGroupsPrunedParquetOpen::build_streamnow returnsVec<BoxStream>instead of a single stream. Per chunk: prepare the access plan slice, build a freshParquetPushDecoder, mint a freshAsyncFileReaderviaParquetFileReaderFactory::create_reader(the first chunk reuses the reader that loaded metadata / page index / bloom filters so its warm cache state is preserved). Row filter is rebuilt per chunk becauseRowFilteris notClone.ParquetOpenState::Readynow holdsVec<BoxStream>.ParquetMorselPlanner::planmaps each stream into aParquetStreamMorsel. EmptyReady(file fully pruned) terminates the planner viaOk(None)instead of emitting an empty morsel plan.EarlyStoppingStreamattaches to the first chunk only.FilePruneris!Cloneand stateful; keeping it on chunk 0 preserves whole-file early-stop on dynamic-filter narrowing.PreparedAccessPlan::reverse, and the chunksVecis reversed so the first emitted morsel corresponds to what was originally the file's last row groups.ParquetMorselizer, defaulting to the new module-level constantsDEFAULT_MORSEL_MAX_ROWS/DEFAULT_MORSEL_MAX_COMPRESSED_BYTES. Wiring throughParquetSourcefor user configuration is deliberately deferred to a follow-up — this PR keeps the public surface unchanged.Are these changes tested?
Yes.
ParquetAccessPlan::split_into_chunkscover: empty plan, all-skip, one-chunk-per-row-group when budget is tight, packing when budget allows, oversized single row group, byte-bounded splits,Skippreserved inside a chunk,Skipbetween chunks,Selectionpreserved verbatim in its chunk.opener.rs:test_row_group_split_produces_multiple_morsels— 3 row groups × 3 rows with a row budget of 3 → 3 morsels; concatenated output matches the single-morsel reference.test_row_group_split_packs_within_budget— budget of 6 rows packs 2 row groups into the first morsel, leaving the third in its own.test_row_group_split_honors_user_skip— a user-suppliedParquetAccessPlanwithSkipin the middle round-trips correctly.test_row_group_split_with_reverse—reverse_row_groups=true+ split budget emits morsels with the originally-last row group first.datasource-parquetlibrary tests still pass (including the fulltest_reverse_scan_*suite).datafusioncore lib (403) and core integration (931) also pass unchanged.Are there any user-facing changes?
No observable semantic change — scans still produce the same rows in the same order. Runtime behavior changes: a multi-row-group Parquet scan now produces multiple morsels per file, each with its own
ParquetPushDecoder+AsyncFileReader. For very small row groups this introduces per-morsel setup overhead that's amortized by the packing budget; for queries that hit LIMIT or dynamic-filter prunes mid-file, it generally lets the scan stop sooner.🤖 Generated with Claude Code