Adaptive filter scheduling + row-group morsel split by adriangb · Pull Request #9 · adriangb/datafusion

adriangb · 2026-04-20T23:56:13Z

Summary

Mashup of two in-flight PRs, branched off adriangb/filter-pushdown-dynamic-bytes-morsels with pydantic#59 cherry-picked on top:

Adaptive filter scheduling for Parquet scans apache/datafusion#21752 — Adaptive filter scheduling for Parquet scans (runtime SelectivityTracker that moves filters between RowFilter and post-scan based on measured effectiveness, plus optional-filter mid-stream skip).
feat: split Parquet files into row-group-sized morsels pydantic/datafusion#59 — Split each Parquet file into row-group-sized morsels (ParquetAccessPlan::split_into_chunks, per-chunk ParquetPushDecoder + AsyncFileReader, EarlyStoppingStream attached to chunk 0 only).

Merge resolution

The two PRs both touched datafusion/datasource-parquet/src/opener.rs::build_stream. The merge keeps:

File-level (once per open): partition_filters bucket split → row-filter vs post-scan, build_row_filter invoked once to drain unbuildable back into post-scan (wrong-result guard), post_scan_other_bytes_per_row precompute, read-plan projection mask over the union of projection + post-scan columns, rebased projection/post-scan exprs against stream_schema.
Per chunk (in the morsel loop): rebuild RowFilter from the (stable) row_filter_conjuncts list — RowFilter is not Clone; build the decoder with projection_mask.clone(); mint a fresh AsyncFileReader (chunk 0 reuses the warm one); clone post_scan_filters and post_scan_other_bytes_per_row into each PushDecoderStreamState; decoder-level .with_limit() still only applied when post_scan_filters.is_empty(); EarlyStoppingStream wraps chunk 0 only.

datafusion/datasource-parquet/src/access_plan.rs and datafusion/datasource-parquet/src/source.rs applied clean.

Test plan

cargo check -p datafusion-datasource-parquet
cargo clippy -p datafusion-datasource-parquet --all-targets --all-features -- -D warnings
cargo test -p datafusion-datasource-parquet --lib — 156 passed, including test_row_group_split_* and full test_reverse_scan_* suite
cargo fmt --all
Broader workspace tests — not run here; apache-main already has an unrelated clippy::mutable_key_type error in datafusion-expr that fails workspace clippy.

🤖 Generated with Claude Code

adriangb · 2026-04-21T00:05:22Z

run benchmarks

baseline:
    ref: main
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: false
       DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS: false
changed:
    ref: HEAD
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: true

adriangbot · 2026-04-21T00:09:10Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4285036605-1634-ldphc 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (3c51143) to main diff using: clickbench_partitioned
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-21T00:09:11Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4285036605-1636-wfks9 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (3c51143) to main diff using: tpch
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-21T00:09:12Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4285036605-1635-r26rw 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (3c51143) to main diff using: tpcds
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-21T00:25:47Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and filter-pushdown-with-row-group-morsels
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃ filter-pushdown-with-row-group-morsels ┃         Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ QQuery 0  │          1.16 / 4.34 ±6.27 / 16.87 ms │           1.16 / 4.34 ±6.28 / 16.89 ms │      no change │
│ QQuery 1  │        12.89 / 13.25 ±0.20 / 13.44 ms │         14.38 / 15.00 ±0.44 / 15.61 ms │   1.13x slower │
│ QQuery 2  │        38.06 / 38.27 ±0.16 / 38.51 ms │         39.27 / 39.41 ±0.09 / 39.55 ms │      no change │
│ QQuery 3  │        31.66 / 32.23 ±0.74 / 33.62 ms │         32.81 / 33.26 ±0.42 / 33.88 ms │      no change │
│ QQuery 4  │     254.82 / 258.49 ±5.52 / 269.38 ms │      247.40 / 253.48 ±4.06 / 260.12 ms │      no change │
│ QQuery 5  │     294.66 / 299.04 ±3.55 / 305.38 ms │      290.90 / 300.21 ±6.68 / 307.83 ms │      no change │
│ QQuery 6  │          6.26 / 9.72 ±3.21 / 14.38 ms │            5.20 / 5.79 ±0.70 / 7.14 ms │  +1.68x faster │
│ QQuery 7  │        14.84 / 15.04 ±0.18 / 15.32 ms │         16.25 / 16.40 ±0.11 / 16.58 ms │   1.09x slower │
│ QQuery 8  │     362.00 / 374.97 ±9.07 / 389.24 ms │     354.80 / 368.40 ±12.84 / 391.10 ms │      no change │
│ QQuery 9  │     510.85 / 524.45 ±9.60 / 538.31 ms │     497.05 / 518.01 ±13.05 / 534.54 ms │      no change │
│ QQuery 10 │        76.41 / 77.70 ±1.49 / 80.61 ms │       99.91 / 101.81 ±3.31 / 108.41 ms │   1.31x slower │
│ QQuery 11 │        87.09 / 87.47 ±0.24 / 87.83 ms │      109.21 / 110.19 ±0.53 / 110.67 ms │   1.26x slower │
│ QQuery 12 │     287.55 / 297.79 ±7.73 / 305.40 ms │      285.61 / 295.83 ±7.23 / 304.13 ms │      no change │
│ QQuery 13 │    413.96 / 431.06 ±15.90 / 459.65 ms │     452.22 / 474.10 ±19.11 / 509.43 ms │   1.10x slower │
│ QQuery 14 │     304.25 / 306.25 ±2.09 / 309.07 ms │      338.84 / 342.41 ±3.63 / 346.87 ms │   1.12x slower │
│ QQuery 15 │    317.56 / 332.41 ±13.54 / 354.95 ms │     324.14 / 339.95 ±14.73 / 363.97 ms │      no change │
│ QQuery 16 │     666.74 / 672.92 ±5.47 / 682.80 ms │      680.91 / 689.64 ±5.54 / 696.00 ms │      no change │
│ QQuery 17 │     672.05 / 674.81 ±3.04 / 680.04 ms │      673.79 / 677.59 ±4.82 / 686.00 ms │      no change │
│ QQuery 18 │ 1369.96 / 1428.18 ±33.05 / 1467.17 ms │  1440.43 / 1484.93 ±33.04 / 1519.74 ms │      no change │
│ QQuery 19 │        30.96 / 35.01 ±6.96 / 48.90 ms │         30.91 / 33.26 ±1.83 / 36.27 ms │      no change │
│ QQuery 20 │    516.20 / 528.43 ±11.48 / 548.22 ms │      518.88 / 525.65 ±8.26 / 540.97 ms │      no change │
│ QQuery 21 │     589.79 / 593.88 ±2.39 / 596.59 ms │      577.68 / 582.63 ±4.90 / 588.96 ms │      no change │
│ QQuery 22 │  1049.80 / 1053.57 ±2.74 / 1058.17 ms │      799.51 / 812.32 ±6.84 / 818.88 ms │  +1.30x faster │
│ QQuery 23 │ 3261.35 / 3290.93 ±22.71 / 3313.94 ms │      283.51 / 292.34 ±5.90 / 301.87 ms │ +11.26x faster │
│ QQuery 24 │        44.03 / 47.87 ±2.54 / 51.89 ms │         39.45 / 41.77 ±1.86 / 44.05 ms │  +1.15x faster │
│ QQuery 25 │     115.34 / 116.06 ±0.79 / 117.03 ms │      120.26 / 122.08 ±1.34 / 124.24 ms │   1.05x slower │
│ QQuery 26 │        42.87 / 44.92 ±1.41 / 47.03 ms │         57.27 / 59.36 ±1.69 / 61.12 ms │   1.32x slower │
│ QQuery 27 │     659.34 / 664.12 ±4.11 / 671.69 ms │      644.05 / 648.14 ±3.78 / 652.92 ms │      no change │
│ QQuery 28 │ 2973.12 / 2997.72 ±15.45 / 3013.48 ms │  2982.64 / 3013.74 ±17.43 / 3034.69 ms │      no change │
│ QQuery 29 │        44.41 / 48.41 ±3.79 / 54.84 ms │         45.26 / 49.96 ±4.71 / 57.81 ms │      no change │
│ QQuery 30 │     328.34 / 335.32 ±7.33 / 348.39 ms │      332.72 / 336.99 ±5.17 / 346.86 ms │      no change │
│ QQuery 31 │     335.27 / 340.42 ±3.57 / 345.43 ms │      328.54 / 333.86 ±3.41 / 338.41 ms │      no change │
│ QQuery 32 │ 1165.27 / 1214.51 ±39.67 / 1260.08 ms │  1184.85 / 1246.93 ±36.11 / 1288.69 ms │      no change │
│ QQuery 33 │ 1521.91 / 1586.67 ±38.36 / 1632.12 ms │  1431.19 / 1540.04 ±64.00 / 1595.49 ms │      no change │
│ QQuery 34 │  1483.22 / 1496.57 ±7.53 / 1505.61 ms │  1460.79 / 1512.56 ±59.65 / 1625.64 ms │      no change │
│ QQuery 35 │     311.29 / 317.20 ±5.15 / 326.63 ms │     308.66 / 315.93 ±10.47 / 336.56 ms │      no change │
│ QQuery 36 │        66.26 / 68.66 ±2.94 / 73.77 ms │         61.89 / 64.74 ±2.06 / 67.77 ms │  +1.06x faster │
│ QQuery 37 │        36.88 / 38.26 ±1.17 / 40.42 ms │         33.75 / 35.04 ±0.68 / 35.66 ms │  +1.09x faster │
│ QQuery 38 │        40.28 / 42.55 ±1.94 / 45.85 ms │         36.48 / 38.10 ±1.31 / 39.68 ms │  +1.12x faster │
│ QQuery 39 │     124.48 / 129.35 ±4.58 / 137.06 ms │      115.11 / 118.85 ±4.97 / 128.47 ms │  +1.09x faster │
│ QQuery 40 │        17.02 / 19.09 ±1.45 / 21.23 ms │         18.09 / 20.15 ±1.25 / 21.76 ms │   1.06x slower │
│ QQuery 41 │        14.87 / 15.28 ±0.29 / 15.74 ms │         15.86 / 17.61 ±1.29 / 19.82 ms │   1.15x slower │
│ QQuery 42 │        13.71 / 14.25 ±0.32 / 14.67 ms │         15.25 / 15.66 ±0.28 / 16.00 ms │   1.10x slower │
└───────────┴───────────────────────────────────────┴────────────────────────────────────────┴────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                     │ 20917.44ms │
│ Total Time (filter-pushdown-with-row-group-morsels)   │ 17848.43ms │
│ Average Time (HEAD)                                   │   486.45ms │
│ Average Time (filter-pushdown-with-row-group-morsels) │   415.08ms │
│ Queries Faster                                        │          8 │
│ Queries Slower                                        │         11 │
│ Queries with No Change                                │         24 │
│ Queries with Failure                                  │          0 │
└───────────────────────────────────────────────────────┴────────────┘

Resource Usage

clickbench_partitioned — base (merge-base)

Metric	Value
Wall time	110.0s
Peak memory	36.0 GiB
Avg memory	27.2 GiB
CPU user	1073.4s
CPU sys	98.8s
Peak spill	0 B

clickbench_partitioned — branch

Metric	Value
Wall time	95.0s
Peak memory	36.7 GiB
Avg memory	27.8 GiB
CPU user	901.3s
CPU sys	93.3s
Peak spill	0 B

File an issue against this benchmark runner

adriangbot · 2026-04-21T00:26:02Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and filter-pushdown-with-row-group-morsels
--------------------
Benchmark tpcds_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                     HEAD ┃   filter-pushdown-with-row-group-morsels ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │              6.43 / 6.87 ±0.66 / 8.19 ms │              6.18 / 6.60 ±0.72 / 8.03 ms │     no change │
│ QQuery 2  │        146.11 / 147.96 ±1.48 / 150.04 ms │        108.52 / 109.26 ±0.68 / 110.19 ms │ +1.35x faster │
│ QQuery 3  │        114.10 / 115.16 ±0.97 / 116.61 ms │        125.29 / 125.98 ±0.66 / 127.24 ms │  1.09x slower │
│ QQuery 4  │    1296.51 / 1329.06 ±18.92 / 1348.79 ms │    1018.98 / 1040.66 ±11.59 / 1051.48 ms │ +1.28x faster │
│ QQuery 5  │        172.51 / 174.22 ±1.13 / 175.68 ms │        172.10 / 175.81 ±2.95 / 180.32 ms │     no change │
│ QQuery 6  │       819.04 / 841.40 ±18.56 / 870.64 ms │        203.20 / 210.17 ±6.88 / 219.28 ms │ +4.00x faster │
│ QQuery 7  │        338.47 / 342.49 ±3.13 / 347.74 ms │        324.54 / 333.00 ±5.34 / 339.36 ms │     no change │
│ QQuery 8  │        116.75 / 117.80 ±0.66 / 118.59 ms │        139.40 / 143.24 ±2.70 / 147.27 ms │  1.22x slower │
│ QQuery 9  │        100.48 / 103.18 ±2.04 / 105.62 ms │         94.92 / 102.26 ±4.35 / 108.50 ms │     no change │
│ QQuery 10 │        104.46 / 106.69 ±1.46 / 108.22 ms │        134.79 / 143.83 ±5.80 / 152.84 ms │  1.35x slower │
│ QQuery 11 │       867.92 / 879.60 ±10.19 / 896.96 ms │       668.96 / 685.49 ±10.99 / 700.33 ms │ +1.28x faster │
│ QQuery 12 │           44.01 / 45.73 ±1.42 / 47.67 ms │           36.25 / 38.09 ±1.07 / 39.29 ms │ +1.20x faster │
│ QQuery 13 │        396.32 / 399.19 ±2.25 / 402.43 ms │        551.31 / 562.55 ±8.67 / 575.18 ms │  1.41x slower │
│ QQuery 14 │     1004.16 / 1008.17 ±4.54 / 1015.95 ms │        872.97 / 884.03 ±7.33 / 892.02 ms │ +1.14x faster │
│ QQuery 15 │           15.39 / 16.50 ±1.10 / 18.54 ms │           18.20 / 19.72 ±1.74 / 22.82 ms │  1.19x slower │
│ QQuery 16 │              7.29 / 7.66 ±0.28 / 7.97 ms │              6.91 / 7.44 ±0.75 / 8.88 ms │     no change │
│ QQuery 17 │        227.71 / 230.46 ±2.26 / 234.46 ms │        177.89 / 179.95 ±2.01 / 183.73 ms │ +1.28x faster │
│ QQuery 18 │        127.72 / 129.08 ±0.87 / 130.04 ms │        177.99 / 187.00 ±5.21 / 192.94 ms │  1.45x slower │
│ QQuery 19 │        153.64 / 155.03 ±1.17 / 157.05 ms │        141.44 / 144.53 ±2.00 / 146.91 ms │ +1.07x faster │
│ QQuery 20 │           13.35 / 14.40 ±0.61 / 14.99 ms │           15.19 / 16.23 ±0.96 / 17.73 ms │  1.13x slower │
│ QQuery 21 │           19.27 / 19.77 ±0.44 / 20.58 ms │           20.63 / 21.35 ±0.38 / 21.71 ms │  1.08x slower │
│ QQuery 22 │        482.88 / 488.66 ±3.92 / 495.06 ms │        491.01 / 492.51 ±1.10 / 493.73 ms │     no change │
│ QQuery 23 │       871.84 / 886.19 ±10.37 / 898.48 ms │       836.36 / 861.87 ±15.27 / 883.92 ms │     no change │
│ QQuery 24 │        383.59 / 388.47 ±4.88 / 395.66 ms │        124.03 / 127.27 ±1.92 / 129.14 ms │ +3.05x faster │
│ QQuery 25 │        342.74 / 344.75 ±1.61 / 346.91 ms │        284.99 / 288.19 ±1.90 / 290.92 ms │ +1.20x faster │
│ QQuery 26 │           80.63 / 81.90 ±0.79 / 82.84 ms │        142.24 / 148.05 ±7.63 / 163.02 ms │  1.81x slower │
│ QQuery 27 │              7.15 / 7.75 ±0.37 / 8.28 ms │              6.58 / 7.08 ±0.42 / 7.71 ms │ +1.09x faster │
│ QQuery 28 │        149.52 / 151.26 ±1.74 / 154.52 ms │        149.90 / 152.88 ±2.30 / 155.91 ms │     no change │
│ QQuery 29 │        283.31 / 284.75 ±1.33 / 287.00 ms │        222.70 / 225.44 ±3.85 / 233.03 ms │ +1.26x faster │
│ QQuery 30 │           43.75 / 45.57 ±2.15 / 49.66 ms │           53.63 / 55.63 ±1.96 / 59.24 ms │  1.22x slower │
│ QQuery 31 │        170.54 / 172.35 ±1.35 / 173.64 ms │        169.48 / 170.63 ±0.78 / 171.77 ms │     no change │
│ QQuery 32 │           13.58 / 13.95 ±0.30 / 14.30 ms │           13.88 / 14.40 ±0.54 / 15.42 ms │     no change │
│ QQuery 33 │        139.25 / 140.75 ±1.29 / 142.91 ms │        127.82 / 132.75 ±2.63 / 135.61 ms │ +1.06x faster │
│ QQuery 34 │              6.79 / 7.10 ±0.23 / 7.48 ms │              6.86 / 7.08 ±0.28 / 7.62 ms │     no change │
│ QQuery 35 │        107.16 / 107.59 ±0.33 / 108.07 ms │        107.47 / 109.69 ±1.93 / 112.30 ms │     no change │
│ QQuery 36 │              6.55 / 6.68 ±0.19 / 7.04 ms │              6.28 / 6.45 ±0.21 / 6.83 ms │     no change │
│ QQuery 37 │              8.09 / 8.64 ±0.35 / 9.02 ms │              5.02 / 5.24 ±0.25 / 5.71 ms │ +1.65x faster │
│ QQuery 38 │           84.72 / 88.62 ±3.02 / 92.73 ms │           89.32 / 93.83 ±3.11 / 98.57 ms │  1.06x slower │
│ QQuery 39 │        123.90 / 125.66 ±1.27 / 127.35 ms │        132.10 / 133.71 ±1.31 / 135.41 ms │  1.06x slower │
│ QQuery 40 │        111.13 / 115.07 ±3.96 / 121.82 ms │        118.38 / 123.58 ±5.87 / 134.77 ms │  1.07x slower │
│ QQuery 41 │           14.06 / 14.70 ±0.70 / 15.82 ms │           16.14 / 17.46 ±1.01 / 18.71 ms │  1.19x slower │
│ QQuery 42 │        106.83 / 108.66 ±1.26 / 110.34 ms │        109.07 / 110.68 ±1.54 / 113.10 ms │     no change │
│ QQuery 43 │              5.56 / 5.68 ±0.13 / 5.94 ms │              5.33 / 5.60 ±0.17 / 5.75 ms │     no change │
│ QQuery 44 │           11.28 / 11.74 ±0.37 / 12.16 ms │           10.90 / 11.50 ±0.37 / 12.01 ms │     no change │
│ QQuery 45 │           50.71 / 52.15 ±0.88 / 53.20 ms │           42.64 / 44.49 ±1.49 / 46.48 ms │ +1.17x faster │
│ QQuery 46 │              8.14 / 8.45 ±0.32 / 8.98 ms │              8.00 / 8.31 ±0.25 / 8.65 ms │     no change │
│ QQuery 47 │        684.22 / 689.23 ±4.20 / 694.83 ms │       684.50 / 705.23 ±14.95 / 720.53 ms │     no change │
│ QQuery 48 │        290.17 / 291.66 ±1.21 / 293.64 ms │        360.18 / 362.30 ±1.49 / 363.74 ms │  1.24x slower │
│ QQuery 49 │        251.79 / 254.53 ±2.11 / 257.40 ms │        238.32 / 241.96 ±3.16 / 247.53 ms │     no change │
│ QQuery 50 │        223.88 / 230.01 ±3.69 / 233.33 ms │        240.18 / 245.23 ±3.76 / 249.82 ms │  1.07x slower │
│ QQuery 51 │        181.79 / 185.32 ±3.79 / 192.18 ms │        211.07 / 213.21 ±2.23 / 217.01 ms │  1.15x slower │
│ QQuery 52 │        107.79 / 108.30 ±0.73 / 109.74 ms │        104.66 / 109.35 ±3.41 / 114.13 ms │     no change │
│ QQuery 53 │        102.23 / 102.79 ±0.71 / 104.01 ms │        143.19 / 147.23 ±2.83 / 152.02 ms │  1.43x slower │
│ QQuery 54 │        144.13 / 147.78 ±2.19 / 150.62 ms │        125.23 / 128.89 ±2.20 / 131.64 ms │ +1.15x faster │
│ QQuery 55 │        106.35 / 106.91 ±0.59 / 107.93 ms │        107.27 / 111.08 ±2.08 / 113.17 ms │     no change │
│ QQuery 56 │        140.93 / 143.24 ±1.38 / 145.22 ms │        129.86 / 132.09 ±1.51 / 133.79 ms │ +1.08x faster │
│ QQuery 57 │        171.55 / 173.73 ±1.76 / 176.51 ms │        180.81 / 182.84 ±1.59 / 185.55 ms │  1.05x slower │
│ QQuery 58 │        273.91 / 275.23 ±1.22 / 277.14 ms │        229.47 / 232.58 ±2.36 / 236.00 ms │ +1.18x faster │
│ QQuery 59 │        197.44 / 200.48 ±1.91 / 203.03 ms │        229.35 / 230.97 ±1.10 / 232.80 ms │  1.15x slower │
│ QQuery 60 │        141.98 / 143.45 ±1.46 / 145.32 ms │        134.61 / 141.10 ±8.93 / 158.77 ms │     no change │
│ QQuery 61 │           12.74 / 12.94 ±0.26 / 13.45 ms │           12.00 / 12.52 ±0.59 / 13.64 ms │     no change │
│ QQuery 62 │       889.61 / 901.88 ±11.68 / 923.49 ms │       866.65 / 927.24 ±33.31 / 955.31 ms │     no change │
│ QQuery 63 │        105.30 / 107.23 ±1.51 / 109.78 ms │        141.75 / 146.48 ±3.52 / 150.51 ms │  1.37x slower │
│ QQuery 64 │        683.30 / 687.86 ±4.14 / 695.21 ms │        719.11 / 732.53 ±7.47 / 741.44 ms │  1.06x slower │
│ QQuery 65 │        248.99 / 253.22 ±2.35 / 255.55 ms │        321.52 / 327.97 ±4.77 / 335.56 ms │  1.30x slower │
│ QQuery 66 │       242.27 / 257.79 ±10.96 / 271.11 ms │        171.83 / 187.42 ±8.19 / 196.07 ms │ +1.38x faster │
│ QQuery 67 │        316.42 / 319.54 ±2.70 / 324.50 ms │        472.48 / 482.74 ±8.86 / 493.10 ms │  1.51x slower │
│ QQuery 68 │             8.50 / 9.76 ±0.81 / 10.66 ms │            9.04 / 10.81 ±1.23 / 12.33 ms │  1.11x slower │
│ QQuery 69 │        101.70 / 103.98 ±1.40 / 105.93 ms │        134.56 / 141.93 ±4.02 / 146.56 ms │  1.36x slower │
│ QQuery 70 │       336.50 / 352.53 ±13.50 / 372.27 ms │        371.03 / 382.30 ±8.92 / 398.50 ms │  1.08x slower │
│ QQuery 71 │        137.38 / 139.60 ±3.30 / 146.06 ms │        127.51 / 132.18 ±2.38 / 134.18 ms │ +1.06x faster │
│ QQuery 72 │        618.04 / 627.34 ±5.87 / 633.71 ms │        482.70 / 490.97 ±5.15 / 498.01 ms │ +1.28x faster │
│ QQuery 73 │              6.61 / 7.56 ±0.69 / 8.74 ms │              6.52 / 7.32 ±0.62 / 8.41 ms │     no change │
│ QQuery 74 │       560.60 / 583.87 ±17.13 / 612.19 ms │        478.29 / 490.31 ±6.22 / 495.16 ms │ +1.19x faster │
│ QQuery 75 │        276.85 / 277.89 ±0.98 / 279.40 ms │        271.12 / 275.03 ±3.58 / 280.67 ms │     no change │
│ QQuery 76 │        131.17 / 133.53 ±2.50 / 138.27 ms │        153.92 / 156.75 ±1.84 / 159.51 ms │  1.17x slower │
│ QQuery 77 │        187.62 / 189.61 ±1.82 / 192.85 ms │        204.61 / 206.55 ±1.68 / 208.89 ms │  1.09x slower │
│ QQuery 78 │        336.84 / 341.57 ±2.75 / 345.26 ms │        301.59 / 308.85 ±4.60 / 313.46 ms │ +1.11x faster │
│ QQuery 79 │        232.91 / 234.83 ±1.98 / 238.37 ms │        255.97 / 261.87 ±3.64 / 265.95 ms │  1.12x slower │
│ QQuery 80 │        323.44 / 327.34 ±2.40 / 330.75 ms │        256.94 / 261.65 ±2.45 / 263.59 ms │ +1.25x faster │
│ QQuery 81 │           27.62 / 29.18 ±1.16 / 30.68 ms │           31.25 / 32.77 ±1.20 / 34.61 ms │  1.12x slower │
│ QQuery 82 │           40.22 / 41.78 ±0.87 / 42.71 ms │           44.75 / 46.01 ±1.26 / 48.34 ms │  1.10x slower │
│ QQuery 83 │           37.95 / 38.95 ±0.91 / 40.53 ms │           41.97 / 42.71 ±0.53 / 43.61 ms │  1.10x slower │
│ QQuery 84 │           47.96 / 49.38 ±0.81 / 50.34 ms │           51.71 / 52.27 ±0.42 / 52.99 ms │  1.06x slower │
│ QQuery 85 │        147.29 / 148.94 ±1.72 / 151.98 ms │        213.38 / 216.94 ±3.36 / 222.29 ms │  1.46x slower │
│ QQuery 86 │           38.75 / 39.64 ±0.58 / 40.34 ms │           42.37 / 43.81 ±0.77 / 44.49 ms │  1.11x slower │
│ QQuery 87 │           85.76 / 88.29 ±2.38 / 92.79 ms │           87.38 / 92.28 ±3.97 / 99.10 ms │     no change │
│ QQuery 88 │         98.96 / 100.22 ±0.86 / 101.49 ms │        120.51 / 121.40 ±1.02 / 123.30 ms │  1.21x slower │
│ QQuery 89 │        119.32 / 120.44 ±0.89 / 121.46 ms │        142.91 / 152.99 ±5.34 / 158.52 ms │  1.27x slower │
│ QQuery 90 │           22.62 / 23.32 ±0.42 / 23.78 ms │           23.29 / 24.07 ±0.92 / 25.87 ms │     no change │
│ QQuery 91 │           63.09 / 63.46 ±0.23 / 63.81 ms │        106.76 / 109.15 ±2.15 / 112.40 ms │  1.72x slower │
│ QQuery 92 │           58.10 / 58.71 ±0.65 / 59.96 ms │           57.41 / 58.86 ±1.42 / 61.49 ms │     no change │
│ QQuery 93 │        186.51 / 189.01 ±2.27 / 192.58 ms │        188.51 / 189.96 ±1.66 / 193.12 ms │     no change │
│ QQuery 94 │           61.30 / 62.03 ±0.47 / 62.74 ms │           69.34 / 70.05 ±0.47 / 70.49 ms │  1.13x slower │
│ QQuery 95 │        129.09 / 130.55 ±1.57 / 133.40 ms │        132.46 / 134.58 ±2.31 / 138.78 ms │     no change │
│ QQuery 96 │           72.21 / 73.41 ±1.09 / 74.77 ms │           85.65 / 91.29 ±2.94 / 94.20 ms │  1.24x slower │
│ QQuery 97 │        125.67 / 127.11 ±1.62 / 130.19 ms │        152.45 / 154.62 ±1.61 / 157.33 ms │  1.22x slower │
│ QQuery 98 │        152.02 / 155.78 ±2.15 / 158.19 ms │        114.88 / 118.48 ±2.24 / 121.66 ms │ +1.31x faster │
│ QQuery 99 │ 10766.32 / 10793.80 ±16.81 / 10816.77 ms │ 10764.64 / 10783.69 ±12.28 / 10803.08 ms │     no change │
└───────────┴──────────────────────────────────────────┴──────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                     │ 31114.06ms │
│ Total Time (filter-pushdown-with-row-group-morsels)   │ 30154.88ms │
│ Average Time (HEAD)                                   │   314.28ms │
│ Average Time (filter-pushdown-with-row-group-morsels) │   304.59ms │
│ Queries Faster                                        │         25 │
│ Queries Slower                                        │         42 │
│ Queries with No Change                                │         32 │
│ Queries with Failure                                  │          0 │
└───────────────────────────────────────────────────────┴────────────┘

Resource Usage

tpcds — base (merge-base)

Metric	Value
Wall time	160.0s
Peak memory	6.3 GiB
Avg memory	5.3 GiB
CPU user	258.6s
CPU sys	17.2s
Peak spill	0 B

tpcds — branch

Metric	Value
Wall time	155.0s
Peak memory	6.6 GiB
Avg memory	5.5 GiB
CPU user	211.3s
CPU sys	22.0s
Peak spill	0 B

File an issue against this benchmark runner

adriangbot · 2026-04-21T01:06:01Z

Benchmark for this request hit the 7200s job deadline before finishing.

Benchmarks requested: tpch

Kubernetes message

Job was active longer than specified deadline

File an issue against this benchmark runner

Each Parquet file previously produced a single morsel containing one `ParquetPushDecoder` over the full pruned `ParquetAccessPlan`. Morselize at row-group granularity instead: after all pruning work is done, pack surviving row groups into chunks bounded by a per-morsel row budget and compressed-byte budget (defaults: 100k rows, 64 MiB). Each chunk becomes its own stream so the executor can interleave row-group decode work with other operators and — in a follow-up — let sibling `FileStream`s steal row-group-sized units of work across partitions. A single oversized row group still becomes its own morsel; no sub-row-group splitting is introduced. `EarlyStoppingStream` (which is driven by the non-Clone `FilePruner`) is attached only to the first morsel's stream so the whole file can still short-circuit on dynamic-filter narrowing. Row-group reversal is applied per-chunk on the `PreparedAccessPlan` and the chunk list is reversed so reverse output order is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

adriangb · 2026-04-21T14:52:11Z

run benchmarks

baseline:
    ref: main
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: false
       DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS: false
changed:
    ref: HEAD
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: true

adriangbot · 2026-04-21T14:55:49Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4289505736-1683-v7545 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (b64f2d9) to main diff using: tpch
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-21T14:55:49Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4289505736-1681-7qrkf 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (b64f2d9) to main diff using: clickbench_partitioned
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-21T14:55:53Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4289505736-1682-6cjpn 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (b64f2d9) to main diff using: tpcds
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-21T15:12:00Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and filter-pushdown-with-row-group-morsels
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃ filter-pushdown-with-row-group-morsels ┃         Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ QQuery 0  │          1.16 / 4.65 ±6.81 / 18.26 ms │           1.18 / 4.41 ±6.37 / 17.16 ms │  +1.05x faster │
│ QQuery 1  │        12.99 / 13.21 ±0.14 / 13.41 ms │         13.80 / 14.53 ±0.37 / 14.81 ms │   1.10x slower │
│ QQuery 2  │        38.02 / 38.28 ±0.29 / 38.81 ms │         38.16 / 38.24 ±0.07 / 38.36 ms │      no change │
│ QQuery 3  │        31.37 / 31.88 ±0.79 / 33.45 ms │         32.48 / 32.85 ±0.27 / 33.28 ms │      no change │
│ QQuery 4  │     240.67 / 248.77 ±4.75 / 253.72 ms │      255.51 / 259.32 ±3.49 / 264.79 ms │      no change │
│ QQuery 5  │     287.74 / 290.05 ±1.94 / 293.27 ms │      294.65 / 299.27 ±2.74 / 303.03 ms │      no change │
│ QQuery 6  │          6.54 / 8.05 ±2.19 / 12.38 ms │            5.20 / 6.37 ±0.88 / 7.46 ms │  +1.26x faster │
│ QQuery 7  │        14.12 / 14.24 ±0.19 / 14.62 ms │         15.53 / 15.74 ±0.14 / 15.96 ms │   1.11x slower │
│ QQuery 8  │     334.80 / 336.89 ±1.78 / 339.66 ms │      355.96 / 363.08 ±7.45 / 376.96 ms │   1.08x slower │
│ QQuery 9  │     510.42 / 522.60 ±9.98 / 537.52 ms │     513.64 / 520.80 ±10.52 / 541.68 ms │      no change │
│ QQuery 10 │        73.72 / 74.96 ±0.77 / 75.94 ms │       99.31 / 101.70 ±3.02 / 107.50 ms │   1.36x slower │
│ QQuery 11 │        85.60 / 86.10 ±0.26 / 86.32 ms │      109.35 / 110.13 ±0.60 / 110.78 ms │   1.28x slower │
│ QQuery 12 │     276.73 / 282.84 ±4.09 / 288.69 ms │      286.15 / 292.24 ±5.74 / 299.18 ms │      no change │
│ QQuery 13 │     401.49 / 409.51 ±6.27 / 419.23 ms │     456.29 / 467.17 ±15.60 / 497.57 ms │   1.14x slower │
│ QQuery 14 │     290.93 / 292.21 ±1.25 / 294.51 ms │      337.34 / 340.35 ±2.01 / 343.48 ms │   1.16x slower │
│ QQuery 15 │     293.68 / 298.35 ±3.50 / 303.27 ms │     316.23 / 335.43 ±15.93 / 358.38 ms │   1.12x slower │
│ QQuery 16 │     630.94 / 636.39 ±5.90 / 647.60 ms │      671.71 / 679.24 ±5.59 / 684.50 ms │   1.07x slower │
│ QQuery 17 │     630.32 / 638.13 ±5.76 / 646.13 ms │      666.18 / 674.89 ±5.36 / 680.67 ms │   1.06x slower │
│ QQuery 18 │  1266.68 / 1279.37 ±7.78 / 1290.36 ms │  1321.71 / 1343.32 ±15.48 / 1359.59 ms │      no change │
│ QQuery 19 │        28.68 / 30.16 ±2.37 / 34.87 ms │         30.03 / 31.33 ±1.28 / 33.76 ms │      no change │
│ QQuery 20 │     519.42 / 524.77 ±6.00 / 535.76 ms │      509.93 / 513.40 ±2.70 / 517.62 ms │      no change │
│ QQuery 21 │     591.65 / 596.23 ±3.92 / 602.73 ms │      577.65 / 583.32 ±4.70 / 588.69 ms │      no change │
│ QQuery 22 │ 1051.93 / 1068.46 ±12.08 / 1082.92 ms │      776.25 / 783.98 ±7.23 / 794.83 ms │  +1.36x faster │
│ QQuery 23 │ 3296.34 / 3330.05 ±24.49 / 3367.80 ms │     273.33 / 298.40 ±21.62 / 337.17 ms │ +11.16x faster │
│ QQuery 24 │        41.72 / 42.13 ±0.36 / 42.76 ms │         35.55 / 36.77 ±0.83 / 38.04 ms │  +1.15x faster │
│ QQuery 25 │     113.33 / 116.14 ±3.28 / 122.32 ms │      120.01 / 121.53 ±1.80 / 124.95 ms │      no change │
│ QQuery 26 │        42.05 / 43.95 ±1.96 / 47.36 ms │         60.82 / 63.04 ±1.80 / 65.14 ms │   1.43x slower │
│ QQuery 27 │    665.41 / 680.68 ±10.17 / 696.34 ms │      635.53 / 641.61 ±5.49 / 651.30 ms │  +1.06x faster │
│ QQuery 28 │  2992.83 / 3008.67 ±8.76 / 3018.18 ms │  2967.51 / 2985.99 ±12.09 / 3005.28 ms │      no change │
│ QQuery 29 │       42.38 / 48.16 ±11.28 / 70.72 ms │         44.46 / 48.56 ±3.22 / 54.08 ms │      no change │
│ QQuery 30 │    307.92 / 323.38 ±25.66 / 374.41 ms │      330.84 / 335.77 ±4.55 / 344.23 ms │      no change │
│ QQuery 31 │     301.47 / 308.46 ±3.65 / 312.17 ms │      333.77 / 343.10 ±7.92 / 353.68 ms │   1.11x slower │
│ QQuery 32 │  1001.79 / 1007.37 ±4.79 / 1015.93 ms │  1010.06 / 1021.84 ±11.17 / 1041.16 ms │      no change │
│ QQuery 33 │ 1415.20 / 1438.33 ±12.88 / 1452.08 ms │  1415.35 / 1433.93 ±12.52 / 1453.00 ms │      no change │
│ QQuery 34 │ 1444.49 / 1460.75 ±14.69 / 1484.94 ms │  1424.63 / 1449.54 ±15.08 / 1462.18 ms │      no change │
│ QQuery 35 │     298.06 / 307.11 ±6.91 / 318.62 ms │      307.38 / 313.73 ±7.46 / 328.11 ms │      no change │
│ QQuery 36 │        63.92 / 71.35 ±5.41 / 80.55 ms │         61.62 / 62.42 ±0.60 / 63.37 ms │  +1.14x faster │
│ QQuery 37 │        36.34 / 36.97 ±0.57 / 37.68 ms │         36.29 / 37.58 ±0.98 / 38.81 ms │      no change │
│ QQuery 38 │        44.00 / 47.91 ±3.40 / 52.55 ms │         36.81 / 39.25 ±1.82 / 41.39 ms │  +1.22x faster │
│ QQuery 39 │     131.64 / 138.40 ±4.61 / 143.91 ms │      119.71 / 121.87 ±3.00 / 127.77 ms │  +1.14x faster │
│ QQuery 40 │        14.88 / 15.13 ±0.14 / 15.29 ms │         20.14 / 21.13 ±1.16 / 22.75 ms │   1.40x slower │
│ QQuery 41 │        14.04 / 16.01 ±3.68 / 23.37 ms │         16.50 / 18.25 ±1.24 / 19.75 ms │   1.14x slower │
│ QQuery 42 │        13.61 / 13.93 ±0.33 / 14.58 ms │         14.16 / 15.00 ±0.45 / 15.51 ms │   1.08x slower │
└───────────┴───────────────────────────────────────┴────────────────────────────────────────┴────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                     │ 20180.99ms │
│ Total Time (filter-pushdown-with-row-group-morsels)   │ 17220.42ms │
│ Average Time (HEAD)                                   │   469.33ms │
│ Average Time (filter-pushdown-with-row-group-morsels) │   400.47ms │
│ Queries Faster                                        │          9 │
│ Queries Slower                                        │         15 │
│ Queries with No Change                                │         19 │
│ Queries with Failure                                  │          0 │
└───────────────────────────────────────────────────────┴────────────┘

Resource Usage

clickbench_partitioned — base (merge-base)

Metric	Value
Wall time	105.0s
Peak memory	30.3 GiB
Avg memory	23.0 GiB
CPU user	1074.5s
CPU sys	59.7s
Peak spill	0 B

clickbench_partitioned — branch

Metric	Value
Wall time	90.0s
Peak memory	39.0 GiB
Avg memory	32.5 GiB
CPU user	898.6s
CPU sys	72.7s
Peak spill	0 B

File an issue against this benchmark runner

adriangbot · 2026-04-21T15:12:56Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and filter-pushdown-with-row-group-morsels
--------------------
Benchmark tpcds_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                     HEAD ┃   filter-pushdown-with-row-group-morsels ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │              7.23 / 7.80 ±0.78 / 9.33 ms │              6.62 / 7.00 ±0.65 / 8.30 ms │ +1.11x faster │
│ QQuery 2  │        150.28 / 150.99 ±0.52 / 151.82 ms │        108.69 / 110.40 ±1.61 / 112.79 ms │ +1.37x faster │
│ QQuery 3  │        115.19 / 115.80 ±0.57 / 116.57 ms │        125.20 / 126.92 ±1.93 / 130.45 ms │  1.10x slower │
│ QQuery 4  │    1348.95 / 1386.14 ±34.60 / 1438.22 ms │    1083.14 / 1120.31 ±20.90 / 1143.38 ms │ +1.24x faster │
│ QQuery 5  │        173.85 / 175.50 ±1.70 / 178.77 ms │        176.48 / 178.76 ±1.26 / 180.22 ms │     no change │
│ QQuery 6  │       850.53 / 876.98 ±16.47 / 901.02 ms │        229.64 / 240.47 ±9.01 / 255.00 ms │ +3.65x faster │
│ QQuery 7  │        336.28 / 343.97 ±4.20 / 348.51 ms │        327.97 / 335.87 ±4.01 / 339.08 ms │     no change │
│ QQuery 8  │        112.28 / 113.34 ±1.05 / 115.21 ms │        134.40 / 140.09 ±3.80 / 144.09 ms │  1.24x slower │
│ QQuery 9  │        101.36 / 104.23 ±3.43 / 110.95 ms │        100.25 / 102.31 ±2.32 / 105.28 ms │     no change │
│ QQuery 10 │        101.89 / 102.06 ±0.14 / 102.32 ms │        136.27 / 143.57 ±5.75 / 152.10 ms │  1.41x slower │
│ QQuery 11 │        930.34 / 945.36 ±9.96 / 959.49 ms │        709.57 / 712.67 ±2.71 / 717.07 ms │ +1.33x faster │
│ QQuery 12 │           43.58 / 44.18 ±0.34 / 44.64 ms │           35.05 / 37.12 ±1.45 / 39.51 ms │ +1.19x faster │
│ QQuery 13 │        385.27 / 395.62 ±7.54 / 406.74 ms │        572.74 / 579.90 ±6.32 / 589.57 ms │  1.47x slower │
│ QQuery 14 │     1005.94 / 1012.48 ±4.67 / 1018.29 ms │        868.25 / 877.52 ±9.36 / 892.04 ms │ +1.15x faster │
│ QQuery 15 │           14.82 / 15.05 ±0.30 / 15.64 ms │           18.91 / 20.77 ±1.96 / 23.49 ms │  1.38x slower │
│ QQuery 16 │              7.26 / 7.42 ±0.17 / 7.76 ms │              7.36 / 8.24 ±0.77 / 9.65 ms │  1.11x slower │
│ QQuery 17 │        226.12 / 230.74 ±3.43 / 234.07 ms │        173.54 / 179.83 ±3.91 / 185.69 ms │ +1.28x faster │
│ QQuery 18 │        125.38 / 126.36 ±0.71 / 127.23 ms │        183.38 / 192.49 ±5.71 / 198.73 ms │  1.52x slower │
│ QQuery 19 │        160.06 / 162.11 ±2.33 / 166.46 ms │        142.21 / 144.92 ±1.64 / 146.91 ms │ +1.12x faster │
│ QQuery 20 │           13.47 / 13.68 ±0.26 / 14.15 ms │           16.26 / 17.05 ±0.76 / 18.15 ms │  1.25x slower │
│ QQuery 21 │           19.25 / 19.47 ±0.13 / 19.64 ms │           22.94 / 23.52 ±0.43 / 24.04 ms │  1.21x slower │
│ QQuery 22 │        474.08 / 486.23 ±9.43 / 502.66 ms │       493.20 / 510.69 ±14.84 / 533.97 ms │  1.05x slower │
│ QQuery 23 │       827.16 / 884.82 ±36.63 / 936.30 ms │        868.78 / 880.22 ±7.16 / 889.22 ms │     no change │
│ QQuery 24 │        399.46 / 402.67 ±4.12 / 410.74 ms │        121.32 / 125.19 ±3.66 / 132.10 ms │ +3.22x faster │
│ QQuery 25 │        346.83 / 348.37 ±1.33 / 350.19 ms │        282.62 / 291.36 ±4.44 / 294.92 ms │ +1.20x faster │
│ QQuery 26 │           79.13 / 79.53 ±0.33 / 80.10 ms │        149.38 / 154.49 ±6.14 / 166.16 ms │  1.94x slower │
│ QQuery 27 │              7.40 / 7.45 ±0.06 / 7.53 ms │              7.04 / 7.84 ±0.85 / 9.06 ms │  1.05x slower │
│ QQuery 28 │        154.37 / 155.01 ±0.99 / 156.98 ms │        150.68 / 153.69 ±2.27 / 156.40 ms │     no change │
│ QQuery 29 │        285.67 / 289.87 ±2.68 / 293.81 ms │        217.55 / 222.57 ±3.77 / 226.47 ms │ +1.30x faster │
│ QQuery 30 │           42.89 / 43.83 ±1.15 / 46.07 ms │           49.45 / 54.18 ±4.12 / 61.12 ms │  1.24x slower │
│ QQuery 31 │        170.45 / 171.86 ±1.61 / 174.65 ms │        168.49 / 173.17 ±2.98 / 177.16 ms │     no change │
│ QQuery 32 │           14.67 / 15.88 ±2.13 / 20.14 ms │           13.67 / 14.90 ±1.15 / 16.93 ms │ +1.07x faster │
│ QQuery 33 │        139.74 / 142.14 ±2.03 / 144.46 ms │        131.00 / 134.96 ±2.90 / 138.86 ms │ +1.05x faster │
│ QQuery 34 │              7.31 / 7.45 ±0.14 / 7.69 ms │              7.23 / 7.74 ±0.70 / 9.12 ms │     no change │
│ QQuery 35 │        102.55 / 104.40 ±1.18 / 105.63 ms │        108.70 / 110.03 ±1.10 / 111.93 ms │  1.05x slower │
│ QQuery 36 │              7.09 / 7.23 ±0.11 / 7.43 ms │              6.21 / 6.55 ±0.25 / 6.93 ms │ +1.10x faster │
│ QQuery 37 │              8.76 / 8.83 ±0.06 / 8.94 ms │              4.97 / 5.23 ±0.19 / 5.51 ms │ +1.69x faster │
│ QQuery 38 │           90.47 / 92.51 ±1.72 / 94.46 ms │          90.35 / 97.85 ±4.92 / 105.70 ms │  1.06x slower │
│ QQuery 39 │        119.23 / 129.19 ±6.18 / 138.69 ms │        133.58 / 137.97 ±5.01 / 147.48 ms │  1.07x slower │
│ QQuery 40 │        103.41 / 110.78 ±6.91 / 121.55 ms │       117.20 / 123.99 ±10.84 / 145.44 ms │  1.12x slower │
│ QQuery 41 │           13.99 / 14.24 ±0.21 / 14.59 ms │           15.34 / 16.49 ±0.72 / 17.42 ms │  1.16x slower │
│ QQuery 42 │        108.17 / 109.67 ±1.12 / 111.12 ms │        110.20 / 113.81 ±2.69 / 117.44 ms │     no change │
│ QQuery 43 │              5.62 / 5.73 ±0.13 / 5.98 ms │              5.32 / 5.50 ±0.20 / 5.90 ms │     no change │
│ QQuery 44 │           11.74 / 12.72 ±1.71 / 16.13 ms │           11.41 / 11.64 ±0.16 / 11.87 ms │ +1.09x faster │
│ QQuery 45 │           48.90 / 49.14 ±0.14 / 49.34 ms │           43.64 / 44.98 ±1.10 / 46.73 ms │ +1.09x faster │
│ QQuery 46 │             8.63 / 9.11 ±0.70 / 10.49 ms │              8.09 / 8.60 ±0.66 / 9.88 ms │ +1.06x faster │
│ QQuery 47 │       702.02 / 743.97 ±23.21 / 766.00 ms │       689.69 / 740.90 ±27.00 / 764.27 ms │     no change │
│ QQuery 48 │        274.61 / 281.02 ±5.17 / 290.36 ms │        349.67 / 365.00 ±9.82 / 379.76 ms │  1.30x slower │
│ QQuery 49 │        249.66 / 251.79 ±1.15 / 253.05 ms │        241.63 / 244.58 ±1.68 / 246.47 ms │     no change │
│ QQuery 50 │        203.89 / 214.39 ±7.04 / 225.58 ms │        245.15 / 257.87 ±7.87 / 267.07 ms │  1.20x slower │
│ QQuery 51 │        176.97 / 182.99 ±4.12 / 187.90 ms │        207.98 / 211.46 ±2.52 / 215.42 ms │  1.16x slower │
│ QQuery 52 │        106.08 / 107.81 ±1.65 / 110.26 ms │        107.64 / 110.52 ±2.44 / 114.03 ms │     no change │
│ QQuery 53 │        102.53 / 104.57 ±1.92 / 108.17 ms │        141.61 / 145.47 ±2.81 / 149.06 ms │  1.39x slower │
│ QQuery 54 │        144.43 / 148.02 ±3.00 / 152.03 ms │        130.32 / 134.01 ±3.24 / 138.98 ms │ +1.10x faster │
│ QQuery 55 │        107.79 / 108.30 ±0.47 / 108.88 ms │        108.73 / 111.90 ±2.76 / 116.60 ms │     no change │
│ QQuery 56 │        142.05 / 143.48 ±1.23 / 145.28 ms │        133.98 / 135.87 ±1.53 / 137.38 ms │ +1.06x faster │
│ QQuery 57 │        167.93 / 171.00 ±1.80 / 173.54 ms │        186.52 / 189.55 ±1.76 / 191.74 ms │  1.11x slower │
│ QQuery 58 │        317.87 / 318.63 ±0.80 / 320.03 ms │        227.26 / 232.41 ±3.01 / 236.19 ms │ +1.37x faster │
│ QQuery 59 │        199.16 / 205.50 ±5.82 / 212.60 ms │        234.87 / 241.42 ±4.32 / 247.26 ms │  1.17x slower │
│ QQuery 60 │        140.99 / 142.64 ±2.32 / 147.14 ms │        140.24 / 140.65 ±0.59 / 141.82 ms │     no change │
│ QQuery 61 │           13.21 / 13.51 ±0.27 / 13.85 ms │           12.65 / 13.12 ±0.76 / 14.62 ms │     no change │
│ QQuery 62 │       874.78 / 886.86 ±11.40 / 908.55 ms │     970.45 / 1013.12 ±39.69 / 1074.88 ms │  1.14x slower │
│ QQuery 63 │        102.61 / 104.04 ±2.15 / 108.30 ms │        146.94 / 149.82 ±2.48 / 154.17 ms │  1.44x slower │
│ QQuery 64 │        667.32 / 680.72 ±7.69 / 690.62 ms │       733.08 / 759.94 ±18.00 / 781.73 ms │  1.12x slower │
│ QQuery 65 │       250.33 / 263.24 ±10.14 / 275.06 ms │        331.60 / 340.23 ±7.94 / 349.79 ms │  1.29x slower │
│ QQuery 66 │       213.41 / 232.05 ±14.12 / 247.68 ms │        180.04 / 190.77 ±7.08 / 198.55 ms │ +1.22x faster │
│ QQuery 67 │       292.70 / 304.76 ±18.81 / 342.13 ms │       473.79 / 498.59 ±18.68 / 522.88 ms │  1.64x slower │
│ QQuery 68 │              8.46 / 8.68 ±0.21 / 9.06 ms │           10.66 / 12.46 ±1.93 / 15.90 ms │  1.44x slower │
│ QQuery 69 │         98.09 / 102.07 ±4.34 / 110.33 ms │        134.04 / 142.69 ±4.65 / 147.89 ms │  1.40x slower │
│ QQuery 70 │       308.74 / 320.65 ±12.04 / 342.93 ms │        388.81 / 397.61 ±4.88 / 403.20 ms │  1.24x slower │
│ QQuery 71 │        133.44 / 138.01 ±5.41 / 148.28 ms │        126.51 / 130.31 ±2.23 / 132.83 ms │ +1.06x faster │
│ QQuery 72 │       591.54 / 612.13 ±12.42 / 626.58 ms │        487.00 / 488.13 ±1.59 / 491.24 ms │ +1.25x faster │
│ QQuery 73 │              6.65 / 6.79 ±0.21 / 7.21 ms │              7.04 / 7.35 ±0.20 / 7.64 ms │  1.08x slower │
│ QQuery 74 │       559.42 / 600.42 ±33.53 / 646.88 ms │        505.62 / 521.53 ±8.66 / 531.75 ms │ +1.15x faster │
│ QQuery 75 │        267.05 / 271.29 ±3.25 / 275.78 ms │        275.56 / 281.33 ±4.85 / 287.18 ms │     no change │
│ QQuery 76 │        130.20 / 134.11 ±5.45 / 144.78 ms │        158.57 / 160.54 ±1.27 / 162.29 ms │  1.20x slower │
│ QQuery 77 │        189.10 / 191.19 ±1.86 / 194.41 ms │        204.70 / 207.74 ±2.70 / 212.55 ms │  1.09x slower │
│ QQuery 78 │        338.33 / 344.53 ±4.13 / 349.16 ms │        307.80 / 313.71 ±7.19 / 327.86 ms │ +1.10x faster │
│ QQuery 79 │        228.00 / 230.61 ±2.67 / 235.69 ms │        258.23 / 266.56 ±7.58 / 279.87 ms │  1.16x slower │
│ QQuery 80 │        321.36 / 323.53 ±1.59 / 325.09 ms │       258.99 / 278.46 ±10.68 / 290.02 ms │ +1.16x faster │
│ QQuery 81 │           25.27 / 25.53 ±0.21 / 25.78 ms │           29.94 / 32.70 ±1.63 / 34.66 ms │  1.28x slower │
│ QQuery 82 │           39.24 / 39.86 ±0.36 / 40.28 ms │           45.27 / 46.66 ±0.70 / 47.06 ms │  1.17x slower │
│ QQuery 83 │           36.56 / 36.93 ±0.26 / 37.31 ms │           42.64 / 44.37 ±1.06 / 45.35 ms │  1.20x slower │
│ QQuery 84 │           45.87 / 46.29 ±0.40 / 46.93 ms │           50.79 / 52.93 ±1.10 / 53.93 ms │  1.14x slower │
│ QQuery 85 │        140.16 / 140.96 ±0.76 / 142.21 ms │        210.04 / 218.89 ±5.73 / 228.10 ms │  1.55x slower │
│ QQuery 86 │           38.63 / 39.29 ±0.58 / 40.27 ms │           42.68 / 43.71 ±0.79 / 45.10 ms │  1.11x slower │
│ QQuery 87 │              3.61 / 3.72 ±0.13 / 3.97 ms │          92.44 / 98.42 ±5.48 / 107.55 ms │ 26.44x slower │
│ QQuery 88 │        102.07 / 104.34 ±2.53 / 108.74 ms │        118.50 / 119.65 ±0.69 / 120.65 ms │  1.15x slower │
│ QQuery 89 │        118.45 / 119.94 ±1.27 / 122.07 ms │        143.39 / 151.57 ±6.05 / 160.76 ms │  1.26x slower │
│ QQuery 90 │           23.28 / 23.62 ±0.19 / 23.85 ms │           24.02 / 25.57 ±0.93 / 26.86 ms │  1.08x slower │
│ QQuery 91 │           60.40 / 62.32 ±2.50 / 67.26 ms │        103.81 / 108.69 ±4.87 / 117.73 ms │  1.74x slower │
│ QQuery 92 │           58.08 / 58.59 ±0.53 / 59.42 ms │           61.67 / 62.15 ±0.65 / 63.41 ms │  1.06x slower │
│ QQuery 93 │        189.94 / 192.39 ±2.03 / 195.76 ms │        184.55 / 193.36 ±6.34 / 200.59 ms │     no change │
│ QQuery 94 │           62.71 / 63.09 ±0.28 / 63.46 ms │           68.89 / 70.68 ±1.40 / 73.08 ms │  1.12x slower │
│ QQuery 95 │        129.62 / 130.76 ±0.95 / 132.47 ms │        121.65 / 126.33 ±5.37 / 136.59 ms │     no change │
│ QQuery 96 │           68.80 / 70.72 ±1.19 / 71.83 ms │           84.51 / 88.27 ±2.76 / 92.97 ms │  1.25x slower │
│ QQuery 97 │        122.21 / 123.63 ±0.93 / 124.58 ms │        150.39 / 156.05 ±4.11 / 161.52 ms │  1.26x slower │
│ QQuery 98 │        156.88 / 160.39 ±1.98 / 162.97 ms │        114.09 / 118.18 ±2.39 / 120.94 ms │ +1.36x faster │
│ QQuery 99 │ 10909.19 / 10996.41 ±70.05 / 11093.29 ms │ 10909.25 / 10973.54 ±53.38 / 11031.61 ms │     no change │
└───────────┴──────────────────────────────────────────┴──────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                     │ 31370.06ms │
│ Total Time (filter-pushdown-with-row-group-morsels)   │ 30880.68ms │
│ Average Time (HEAD)                                   │   316.87ms │
│ Average Time (filter-pushdown-with-row-group-morsels) │   311.93ms │
│ Queries Faster                                        │         29 │
│ Queries Slower                                        │         51 │
│ Queries with No Change                                │         19 │
│ Queries with Failure                                  │          0 │
└───────────────────────────────────────────────────────┴────────────┘

Resource Usage

tpcds — base (merge-base)

Metric	Value
Wall time	160.0s
Peak memory	6.3 GiB
Avg memory	5.6 GiB
CPU user	263.9s
CPU sys	8.7s
Peak spill	0 B

tpcds — branch

Metric	Value
Wall time	155.0s
Peak memory	6.5 GiB
Avg memory	5.4 GiB
CPU user	215.4s
CPU sys	23.0s
Peak spill	0 B

File an issue against this benchmark runner

adriangb · 2026-04-21T15:19:14Z

run benchmark clickbench_partitioned

baseline:
    ref: main
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: true
       DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS: true
changed:
    ref: HEAD
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: true

adriangbot · 2026-04-21T15:20:49Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4289709612-1685-jb5wr 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (28ebd52) to main diff using: clickbench_partitioned
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-21T15:37:02Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and filter-pushdown-with-row-group-morsels
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃ filter-pushdown-with-row-group-morsels ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │          1.17 / 4.70 ±6.90 / 18.51 ms │           1.18 / 4.47 ±6.43 / 17.32 ms │     no change │
│ QQuery 1  │        12.71 / 13.03 ±0.21 / 13.28 ms │         15.42 / 16.32 ±0.53 / 16.86 ms │  1.25x slower │
│ QQuery 2  │        37.47 / 38.41 ±0.87 / 40.04 ms │         41.82 / 42.20 ±0.45 / 43.05 ms │  1.10x slower │
│ QQuery 3  │        32.12 / 33.95 ±1.49 / 35.85 ms │         34.15 / 34.28 ±0.18 / 34.63 ms │     no change │
│ QQuery 4  │     248.60 / 250.92 ±2.04 / 253.88 ms │      260.51 / 268.08 ±8.05 / 282.65 ms │  1.07x slower │
│ QQuery 5  │     288.64 / 291.35 ±2.57 / 295.39 ms │      305.57 / 310.71 ±5.56 / 319.49 ms │  1.07x slower │
│ QQuery 6  │           5.96 / 6.30 ±0.31 / 6.76 ms │           5.83 / 8.96 ±4.48 / 17.71 ms │  1.42x slower │
│ QQuery 7  │        16.37 / 16.70 ±0.19 / 16.94 ms │         17.46 / 17.61 ±0.16 / 17.89 ms │  1.05x slower │
│ QQuery 8  │     338.72 / 340.12 ±1.45 / 342.43 ms │      368.50 / 372.14 ±4.40 / 380.34 ms │  1.09x slower │
│ QQuery 9  │     507.84 / 519.56 ±9.30 / 529.41 ms │      524.28 / 533.67 ±5.14 / 538.77 ms │     no change │
│ QQuery 10 │       99.39 / 99.86 ±0.58 / 100.94 ms │      101.84 / 104.98 ±3.12 / 109.74 ms │  1.05x slower │
│ QQuery 11 │     108.66 / 109.61 ±0.67 / 110.31 ms │      113.19 / 113.92 ±0.77 / 115.24 ms │     no change │
│ QQuery 12 │     316.41 / 319.26 ±2.18 / 322.27 ms │     294.70 / 305.41 ±10.63 / 324.24 ms │     no change │
│ QQuery 13 │    441.74 / 461.62 ±16.00 / 488.41 ms │     465.02 / 489.68 ±19.02 / 513.22 ms │  1.06x slower │
│ QQuery 14 │     328.66 / 335.32 ±6.95 / 344.50 ms │      340.35 / 345.12 ±3.11 / 349.13 ms │     no change │
│ QQuery 15 │     295.27 / 298.01 ±3.16 / 304.03 ms │     328.98 / 347.52 ±16.27 / 377.75 ms │  1.17x slower │
│ QQuery 16 │     635.23 / 640.71 ±3.23 / 644.30 ms │      688.20 / 691.79 ±3.31 / 698.05 ms │  1.08x slower │
│ QQuery 17 │     635.40 / 652.16 ±9.09 / 659.71 ms │      689.89 / 692.70 ±2.41 / 696.48 ms │  1.06x slower │
│ QQuery 18 │ 1297.48 / 1313.87 ±13.50 / 1329.99 ms │  1366.82 / 1427.81 ±44.46 / 1479.28 ms │  1.09x slower │
│ QQuery 19 │        30.97 / 31.30 ±0.49 / 32.27 ms │        32.69 / 46.65 ±17.08 / 75.16 ms │  1.49x slower │
│ QQuery 20 │    519.75 / 531.80 ±11.20 / 551.88 ms │     527.47 / 534.24 ±10.38 / 554.75 ms │     no change │
│ QQuery 21 │     573.45 / 582.17 ±7.71 / 594.49 ms │      584.27 / 590.08 ±3.00 / 592.46 ms │     no change │
│ QQuery 22 │    934.30 / 952.36 ±17.52 / 975.32 ms │      783.86 / 795.84 ±7.71 / 807.33 ms │ +1.20x faster │
│ QQuery 23 │     114.85 / 125.23 ±7.01 / 134.99 ms │     290.27 / 312.90 ±19.22 / 347.25 ms │  2.50x slower │
│ QQuery 24 │        42.50 / 43.35 ±0.55 / 44.13 ms │         36.24 / 38.79 ±1.81 / 41.85 ms │ +1.12x faster │
│ QQuery 25 │     149.41 / 151.92 ±2.29 / 154.99 ms │      121.50 / 122.41 ±0.91 / 123.87 ms │ +1.24x faster │
│ QQuery 26 │        62.48 / 63.41 ±0.88 / 64.99 ms │         63.12 / 64.85 ±1.37 / 66.94 ms │     no change │
│ QQuery 27 │    718.46 / 730.30 ±10.06 / 744.30 ms │      645.44 / 648.76 ±3.97 / 655.29 ms │ +1.13x faster │
│ QQuery 28 │  3045.70 / 3057.58 ±7.97 / 3068.48 ms │   3036.86 / 3043.54 ±5.73 / 3051.01 ms │     no change │
│ QQuery 29 │      43.28 / 59.64 ±23.51 / 103.87 ms │         47.33 / 51.68 ±2.93 / 55.48 ms │ +1.15x faster │
│ QQuery 30 │     320.83 / 325.80 ±4.91 / 334.48 ms │      334.00 / 341.20 ±8.26 / 357.17 ms │     no change │
│ QQuery 31 │     319.28 / 324.84 ±6.06 / 336.47 ms │      336.22 / 343.05 ±5.58 / 351.05 ms │  1.06x slower │
│ QQuery 32 │ 1023.84 / 1036.55 ±14.61 / 1064.57 ms │  1044.54 / 1057.65 ±19.91 / 1097.01 ms │     no change │
│ QQuery 33 │ 1469.46 / 1497.79 ±19.30 / 1525.88 ms │  1449.80 / 1473.57 ±19.33 / 1496.72 ms │     no change │
│ QQuery 34 │ 1468.21 / 1502.01 ±23.21 / 1536.74 ms │  1514.52 / 1527.73 ±10.58 / 1546.67 ms │     no change │
│ QQuery 35 │    296.59 / 318.29 ±24.90 / 352.69 ms │      330.62 / 335.43 ±5.12 / 345.33 ms │  1.05x slower │
│ QQuery 36 │        62.78 / 73.21 ±7.05 / 83.44 ms │         63.08 / 66.25 ±3.18 / 72.30 ms │ +1.10x faster │
│ QQuery 37 │        41.28 / 42.42 ±1.53 / 45.27 ms │         36.50 / 37.95 ±1.43 / 40.62 ms │ +1.12x faster │
│ QQuery 38 │        35.70 / 39.18 ±2.06 / 41.69 ms │         36.96 / 40.45 ±2.19 / 43.51 ms │     no change │
│ QQuery 39 │     119.80 / 131.21 ±6.54 / 138.35 ms │      120.18 / 123.14 ±2.55 / 126.71 ms │ +1.07x faster │
│ QQuery 40 │        18.95 / 19.90 ±1.28 / 22.44 ms │         20.73 / 23.24 ±2.11 / 26.94 ms │  1.17x slower │
│ QQuery 41 │        17.64 / 19.08 ±1.54 / 21.95 ms │         16.99 / 18.38 ±0.80 / 19.25 ms │     no change │
│ QQuery 42 │        14.92 / 15.10 ±0.15 / 15.37 ms │         16.17 / 16.35 ±0.21 / 16.72 ms │  1.08x slower │
└───────────┴───────────────────────────────────────┴────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                     │ 17419.92ms │
│ Total Time (filter-pushdown-with-row-group-morsels)   │ 17781.51ms │
│ Average Time (HEAD)                                   │   405.11ms │
│ Average Time (filter-pushdown-with-row-group-morsels) │   413.52ms │
│ Queries Faster                                        │          8 │
│ Queries Slower                                        │         19 │
│ Queries with No Change                                │         16 │
│ Queries with Failure                                  │          0 │
└───────────────────────────────────────────────────────┴────────────┘

Resource Usage

clickbench_partitioned — base (merge-base)

Metric	Value
Wall time	90.0s
Peak memory	30.2 GiB
Avg memory	22.9 GiB
CPU user	924.1s
CPU sys	55.4s
Peak spill	0 B

clickbench_partitioned — branch

Metric	Value
Wall time	90.0s
Peak memory	38.1 GiB
Avg memory	28.2 GiB
CPU user	921.8s
CPU sys	80.7s
Peak spill	0 B

File an issue against this benchmark runner

adriangb · 2026-04-21T16:00:21Z

Regression diagnosis: it's the adaptive tracker itself, not the placement

TL;DR: On the main regressed queries, forcing every filter to a static placement (either all-PostScan or all-RowFilter) is 20–50 ms faster than the adaptive path. The tracker's per-morsel partition_filters call and per-batch SelectivityStats::update are adding more overhead than the adaptive decision is saving.

Local A/B with four placement strategies (5 iterations, M-series)

Knobs used on the branch binary + --pushdown:

adaptive — defaults (filter_collecting_byte_ratio_threshold=0.20, filter_pushdown_min_bytes_per_sec=104857600)
all-postscan — byte_ratio_threshold=0.0 + min_bytes_per_sec=1e18 (every filter stays PostScan forever)
all-rowfilter — byte_ratio_threshold=10.0 + min_bytes_per_sec=1e18 (every filter stays RowFilter forever)
main+off — reference: apache/main with pushdown=off

Query	adaptive	all-postscan	all-rowfilter	main+off
Q10	97.05	70.20	70.73	78.92
Q11	102.28	85.03	79.36	80.26
Q14	365.03	312.54	313.04	318.98
Q26	57.57	51.39	54.92	41.68
Q40	13.40	10.74	14.18	11.19

For Q10, Q14, Q40 — all-postscan is faster than apache/main with pushdown=off. The branch's "worst case" placement is already better than the no-pushdown baseline. The adaptive path is just wasting CPU on top of that.

Where the overhead is

Two sources, both on the hot path:

SelectivityTracker::partition_filters fires per morsel. Each call takes the tracker's inner mutex, reads filter_stats under a parking_lot RwLock, iterates all conjuncts, computes byte_ratio (walking row groups), and updates filter_states. At ~50-100µs per call × ~300 morsels (100 files × 2-3 chunks each) = 15-30ms on the critical path, contested across partitions. This is the cost we paid to "let stats flow between morsels within a file" and it's real — Q14 also benefits from it (it would be 700+ ms without per-morsel placement).
apply_post_scan_filters_with_stats::tracker.update runs per batch. Each batch locks filter_stats (RwLock read) then the per-filter Mutex, does Welford math, and checks the skip-flag gate. For a ~1M-row file at 8k-row batches, that's ~125 lock pairs per file × 100 files = ~12,500 lock-pair acquisitions per query.

What does this tell us

Adaptive currently beats pure static in the one spot it was designed for — hash-join dynamic filters like Q23 (3.3 s → 298 ms, still 11× faster than main+off). But on ClickBench user-written filters, static PostScan is a very hard baseline to beat, and our per-morsel/per-batch bookkeeping consistently loses to it.

Possible fixes, smallest → largest

Reduce partition_filters frequency. Cache the placement per-file in LazyMorselShared after the first morsel's decision; re-query the tracker only every N morsels or when stats move by Δ. Keeps the "feedback within a file" property for long files but cuts 80% of lock acquisitions.
Batch SelectivityStats::update. Accumulate per-batch counters on the morsel's stack, flush to the shared stats once per morsel (or on Drop). Removes per-batch locking entirely.
Skip the update path for finalized filters. If min_bytes_per_sec is infinity (default off) OR the filter has been in the same state for N consecutive flushes with stable CI, stop updating it — it's not going to move.
Skip the tracker when the query isn't going to adapt. If we detect at build_stream time that there are no dynamic/optional filters and the user's pushdown mode is "on", we could short-circuit to plain row-filter for small-byte-ratio filters (like the old reorder_filters code), skipping the tracker entirely.

Happy to prototype one of these — (2) looks cheapest and most impactful since it removes the per-batch locks without changing any placement behavior. (4) would structurally recover the old code path for the common case, at the cost of "no adaptation unless we know adaptation is needed".

On the morsel split itself

No-filter queries (Q15/Q16/Q17) also regress ~5–10% at pushdown=off, which is independent of everything above — it's the morsel-split fan-out cost (multiple decoders + readers + projectors per file). Setting morsel_max_rows=u64::MAX (1 chunk/file) doesn't recover it fully either, so some overhead is in the lazy-morsel code path itself, not the fan-out. Smaller effect than the tracker overhead, but worth a separate pass.

🤖 Generated with Claude Code

adriangbot · 2026-04-21T16:52:50Z

Benchmark for this request hit the 7200s job deadline before finishing.

Benchmarks requested: tpch

Kubernetes message

Job was active longer than specified deadline

File an issue against this benchmark runner

adriangb · 2026-04-21T17:41:28Z

Isolation experiment: where does the overhead actually live?

Built the three strata separately, then 10-iter 4-way on ClickBench-partitioned (local M-series):

Binary	Content
`main`	apache/main @ `9a1ed57859`
`lazy-split`	PR #10's morsel commit + `ParquetLazyMorsel` refactor without any adaptive-filter code (new branch `morsel-split-lazy`)
`full`	this PR branch before today's fast-path commit
`full-opt`	this PR branch + today's "skip adaptive bookkeeping when post-scan is empty" commit

pushdown=off

Query	main	lazy-split	full	full-opt
Q14	325	313	311	304
Q15	287	276	310	320
Q16	757	743	831	815
Q24	43	46	47	51
Q26	41	49	44	45

pushdown=on

Query	main	lazy-split	full	full-opt
Q10	91	96	87	87
Q11	99	98	96	96
Q14	354	348	343	348
Q24	41	40	31	35
Q26	56	60	57	54
Q40	14	13	13	13

Findings

Eager morsel split (PR feat: split Parquet files into row-group-sized morsels #10 as-is) is measurably slower than main on some queries (+15% on Q15/Q16 at pushdown=off). Root cause: packing every chunk's decoder setup (build_row_filter + ParquetPushDecoderBuilder::build + create_reader) into one burst inside build_stream, all on the scheduler thread before any morsel starts executing.
Lazy morsels alone recover it: lazy-split matches or beats main on every query above, with no adaptive machinery in sight. The win comes from moving that same per-chunk setup into each morsel's into_stream, so the CPU cost is distributed across the worker pool instead of serialised on the planner.
Adaptive, at pushdown=on, is at parity with or beats main on every tested query. The "20–50 ms regressions" we saw a few runs ago look like they were largely variance + the older eager-morsel code. The earlier all-postscan experiment still stands as evidence that placement decisions matter, but the tracker data structures themselves are not the cost source.
Microbench separately confirms (2): SelectivityTracker::update is 9.9 ns/call, partition_filters is 40 ns warm, and a full-query simulation (100 files × 3 morsels × 60 batches × 3 filters) is 521 µs total. The tracker itself is an order of magnitude below the noise floor of the full query.

What changed in `full-opt`

ParquetLazyMorsel::build_stream_now used to unconditionally allocate Vec<Arc<PhysicalExpr>> for the read-plan expression set and iterate the (empty) post-scan list to precompute per-filter byte rates. Both paths now take a fast branch when post_scan_conjuncts.is_empty() — which covers pushdown=off and any query without filters — passing the projection's expr_iter() directly to build_projection_read_plan and skipping the two empty loops. Same behaviour, tighter per-morsel path. Pushed as af2a26ff15.

Retarget PR feat: split Parquet files into row-group-sized morsels #10 to morsel-split-lazy (or cherry-pick the lazy-morsel commit onto PR feat: split Parquet files into row-group-sized morsels #10). It's strictly better than the eager variant on the queries above and is structurally simpler (no warm-reader branch).
Leave this PR as is for the filter-work. The adaptive layer is not the bottleneck any of the isolation runs could find; further tuning should start from a fresh ClickBench run against current full-opt rather than from the numbers in the original "mixed bag" report.

🤖 Generated with Claude Code

adriangb · 2026-04-21T17:43:12Z

run benchmark clickbench_partitioned

baseline:
    ref: main
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: true
       DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS: true
changed:
    ref: HEAD
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: true

adriangb · 2026-04-21T17:43:19Z

run benchmark clickbench_partitioned

baseline:
    ref: main
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: false
       DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS: true
changed:
    ref: HEAD
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: true

The previous `build_stream` built every morsel's `RowFilter`, `ParquetPushDecoder`, `AsyncFileReader`, and `Projector` eagerly in a single loop inside the file planner — before any morsel was scheduled. That loop ran on the scheduler thread and was visible as a 10–15% regression vs. main on ClickBench-partitioned queries that have many row-group morsels per file (e.g. Q15, Q16 at pushdown=off). Replace `ParquetStreamMorsel` (which held a pre-built `BoxStream`) with `ParquetLazyMorsel`, which holds only the per-chunk `ParquetAccessPlan` plus an `Arc<LazyMorselShared>` of the file-level state. The decoder and reader are constructed inside `Morsel::into_stream`, so each morsel pays its setup cost only when the scheduler actually picks it up, and the work is distributed across worker threads instead of serialised on the planner. `FilePruner` is `!Clone` and drives whole-file early-stop via `EarlyStoppingStream`, so it still lives on chunk 0's morsel only. The warm `async_file_reader` from metadata / page-index / bloom-filter load is dropped at the end of `build_stream` — every morsel mints a fresh reader via the factory at `into_stream` time. For both built-in factories (`DefaultParquetFileReaderFactory`, `CachedParquetFileReaderFactory`) the "warm cache" benefit of reusing a reader is negligible because the underlying `Arc<dyn ObjectStore>` / `Arc<dyn FileMetadataCache>` is already shared across readers, so the simplification is free. Local ClickBench-partitioned, 10 iterations, pushdown=off (M-series): | Query | main | eager (before) | lazy (this commit) | |-------|------:|---------------:|-------------------:| | Q14 | 325 | 335 | 313 ms | | Q15 | 309 | 358 | 302 ms | | Q16 | 911 | 1049 | 786 ms | | Q24 | 48 | 55 | 56 ms | | Q26 | 41 | 45 | 45 ms | Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

adriangbot · 2026-04-21T17:46:33Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4290624868-1690-457rs 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (af2a26f) to main diff using: clickbench_partitioned
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-21T17:47:05Z

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4290625502-1691-pxdwg 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (af2a26f) to main diff using: clickbench_partitioned
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-04-21T18:03:07Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and filter-pushdown-with-row-group-morsels
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃ filter-pushdown-with-row-group-morsels ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │          1.21 / 4.75 ±6.93 / 18.61 ms │           1.22 / 4.55 ±6.50 / 17.54 ms │     no change │
│ QQuery 1  │        12.99 / 13.35 ±0.22 / 13.66 ms │         14.91 / 15.47 ±0.35 / 15.92 ms │  1.16x slower │
│ QQuery 2  │        38.52 / 39.10 ±0.63 / 40.31 ms │         38.82 / 39.40 ±0.42 / 39.90 ms │     no change │
│ QQuery 3  │        32.19 / 33.15 ±0.94 / 34.93 ms │         34.27 / 34.82 ±0.42 / 35.46 ms │  1.05x slower │
│ QQuery 4  │     249.87 / 256.60 ±4.70 / 262.25 ms │      274.80 / 281.43 ±6.46 / 291.98 ms │  1.10x slower │
│ QQuery 5  │     294.50 / 296.77 ±1.79 / 299.89 ms │      311.86 / 321.94 ±7.94 / 331.95 ms │  1.08x slower │
│ QQuery 6  │           5.92 / 6.37 ±0.27 / 6.75 ms │            5.52 / 6.21 ±0.45 / 6.86 ms │     no change │
│ QQuery 7  │        16.81 / 16.98 ±0.12 / 17.10 ms │         16.97 / 17.26 ±0.21 / 17.59 ms │     no change │
│ QQuery 8  │     340.08 / 344.42 ±2.27 / 346.49 ms │      390.62 / 400.34 ±8.92 / 413.61 ms │  1.16x slower │
│ QQuery 9  │     523.08 / 536.11 ±6.83 / 542.70 ms │      550.68 / 559.28 ±8.42 / 573.66 ms │     no change │
│ QQuery 10 │     100.64 / 101.55 ±0.72 / 102.76 ms │      103.66 / 106.26 ±4.12 / 114.38 ms │     no change │
│ QQuery 11 │     111.11 / 114.12 ±2.59 / 118.81 ms │      114.94 / 116.71 ±1.29 / 118.22 ms │     no change │
│ QQuery 12 │     320.23 / 326.90 ±6.20 / 338.19 ms │      302.34 / 314.17 ±6.67 / 320.93 ms │     no change │
│ QQuery 13 │     445.11 / 454.00 ±6.66 / 463.24 ms │     484.52 / 505.03 ±21.82 / 546.40 ms │  1.11x slower │
│ QQuery 14 │     332.84 / 336.63 ±5.81 / 348.20 ms │      356.39 / 362.17 ±3.10 / 365.32 ms │  1.08x slower │
│ QQuery 15 │     298.53 / 307.33 ±5.99 / 316.04 ms │     351.05 / 369.63 ±24.07 / 416.28 ms │  1.20x slower │
│ QQuery 16 │     646.27 / 657.15 ±6.69 / 667.35 ms │     712.39 / 730.43 ±13.58 / 749.00 ms │  1.11x slower │
│ QQuery 17 │     649.83 / 654.65 ±4.88 / 662.50 ms │      722.65 / 729.05 ±5.15 / 734.53 ms │  1.11x slower │
│ QQuery 18 │ 1309.69 / 1333.06 ±22.98 / 1365.67 ms │  1413.09 / 1518.09 ±62.82 / 1604.56 ms │  1.14x slower │
│ QQuery 19 │        31.55 / 32.17 ±0.53 / 32.95 ms │         32.76 / 36.85 ±6.78 / 50.26 ms │  1.15x slower │
│ QQuery 20 │     523.06 / 534.32 ±8.23 / 547.72 ms │      530.53 / 534.93 ±4.92 / 542.21 ms │     no change │
│ QQuery 21 │    577.24 / 590.01 ±10.65 / 604.43 ms │      604.80 / 614.38 ±8.17 / 629.11 ms │     no change │
│ QQuery 22 │     945.92 / 957.49 ±7.37 / 966.35 ms │      822.05 / 836.06 ±8.54 / 847.07 ms │ +1.15x faster │
│ QQuery 23 │     115.26 / 121.56 ±6.29 / 133.39 ms │     302.06 / 332.95 ±22.15 / 357.86 ms │  2.74x slower │
│ QQuery 24 │        42.13 / 48.92 ±9.08 / 66.75 ms │         36.46 / 38.81 ±1.42 / 40.36 ms │ +1.26x faster │
│ QQuery 25 │     150.11 / 153.02 ±2.47 / 157.52 ms │      123.52 / 125.40 ±1.76 / 127.64 ms │ +1.22x faster │
│ QQuery 26 │        62.89 / 68.03 ±4.70 / 73.71 ms │         61.88 / 64.96 ±1.86 / 67.00 ms │     no change │
│ QQuery 27 │     731.61 / 736.25 ±5.21 / 744.91 ms │      654.35 / 657.61 ±4.27 / 665.88 ms │ +1.12x faster │
│ QQuery 28 │ 3097.90 / 3136.43 ±21.35 / 3158.37 ms │  3065.09 / 3119.23 ±34.64 / 3157.66 ms │     no change │
│ QQuery 29 │        44.16 / 44.62 ±0.47 / 45.37 ms │         47.29 / 52.38 ±4.19 / 58.37 ms │  1.17x slower │
│ QQuery 30 │     343.29 / 351.25 ±8.98 / 367.69 ms │      361.55 / 367.97 ±5.57 / 378.28 ms │     no change │
│ QQuery 31 │     335.17 / 339.10 ±3.77 / 346.15 ms │      358.70 / 365.53 ±4.23 / 370.54 ms │  1.08x slower │
│ QQuery 32 │ 1111.43 / 1132.05 ±11.54 / 1145.20 ms │  1116.99 / 1131.70 ±18.58 / 1168.24 ms │     no change │
│ QQuery 33 │ 1561.02 / 1595.05 ±23.12 / 1629.97 ms │  1579.94 / 1597.85 ±12.55 / 1613.49 ms │     no change │
│ QQuery 34 │ 1581.01 / 1597.14 ±16.20 / 1622.41 ms │  1613.04 / 1659.22 ±32.74 / 1695.30 ms │     no change │
│ QQuery 35 │    326.30 / 347.99 ±28.28 / 403.65 ms │      363.61 / 375.91 ±7.40 / 383.64 ms │  1.08x slower │
│ QQuery 36 │        66.81 / 71.03 ±3.08 / 74.24 ms │         61.93 / 69.56 ±3.97 / 72.88 ms │     no change │
│ QQuery 37 │        38.23 / 47.59 ±7.84 / 56.00 ms │         36.01 / 37.93 ±1.13 / 39.19 ms │ +1.25x faster │
│ QQuery 38 │        36.98 / 42.40 ±5.32 / 50.98 ms │         36.82 / 40.95 ±2.84 / 44.34 ms │     no change │
│ QQuery 39 │     124.67 / 131.61 ±3.65 / 134.98 ms │      132.75 / 138.62 ±5.70 / 149.19 ms │  1.05x slower │
│ QQuery 40 │        19.22 / 22.60 ±3.08 / 26.38 ms │         20.20 / 22.33 ±1.78 / 25.10 ms │     no change │
│ QQuery 41 │        18.06 / 19.63 ±1.84 / 23.23 ms │         18.50 / 19.53 ±1.62 / 22.74 ms │     no change │
│ QQuery 42 │        15.37 / 19.99 ±4.83 / 26.81 ms │         16.45 / 16.83 ±0.24 / 17.10 ms │ +1.19x faster │
└───────────┴───────────────────────────────────────┴────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                     │ 17973.23ms │
│ Total Time (filter-pushdown-with-row-group-morsels)   │ 18689.74ms │
│ Average Time (HEAD)                                   │   417.98ms │
│ Average Time (filter-pushdown-with-row-group-morsels) │   434.65ms │
│ Queries Faster                                        │          6 │
│ Queries Slower                                        │         17 │
│ Queries with No Change                                │         20 │
│ Queries with Failure                                  │          0 │
└───────────────────────────────────────────────────────┴────────────┘

Resource Usage

clickbench_partitioned — base (merge-base)

Metric	Value
Wall time	95.0s
Peak memory	28.5 GiB
Avg memory	23.1 GiB
CPU user	955.8s
CPU sys	57.7s
Peak spill	0 B

clickbench_partitioned — branch

Metric	Value
Wall time	95.0s
Peak memory	38.8 GiB
Avg memory	30.7 GiB
CPU user	963.5s
CPU sys	90.5s
Peak spill	0 B

File an issue against this benchmark runner

adriangbot · 2026-04-21T18:03:57Z

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

Comparing HEAD and filter-pushdown-with-row-group-morsels
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃ filter-pushdown-with-row-group-morsels ┃         Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ QQuery 0  │          1.20 / 4.53 ±6.64 / 17.81 ms │           1.19 / 4.50 ±6.48 / 17.46 ms │      no change │
│ QQuery 1  │        12.65 / 13.07 ±0.22 / 13.29 ms │         14.39 / 15.23 ±0.47 / 15.78 ms │   1.17x slower │
│ QQuery 2  │        37.27 / 37.65 ±0.32 / 38.09 ms │         39.05 / 39.32 ±0.17 / 39.53 ms │      no change │
│ QQuery 3  │        32.72 / 33.33 ±0.55 / 34.26 ms │         34.52 / 34.93 ±0.25 / 35.22 ms │      no change │
│ QQuery 4  │     245.24 / 257.80 ±9.85 / 269.13 ms │      274.06 / 280.27 ±3.85 / 284.25 ms │   1.09x slower │
│ QQuery 5  │     280.55 / 290.18 ±8.07 / 301.95 ms │      311.31 / 320.03 ±5.23 / 327.08 ms │   1.10x slower │
│ QQuery 6  │           7.13 / 7.39 ±0.27 / 7.85 ms │            5.67 / 6.32 ±0.66 / 7.35 ms │  +1.17x faster │
│ QQuery 7  │        14.96 / 15.27 ±0.26 / 15.63 ms │         17.01 / 17.38 ±0.30 / 17.84 ms │   1.14x slower │
│ QQuery 8  │     338.96 / 351.23 ±7.00 / 358.67 ms │      404.00 / 409.89 ±9.03 / 427.71 ms │   1.17x slower │
│ QQuery 9  │    533.54 / 547.06 ±11.11 / 565.97 ms │     514.44 / 523.18 ±12.48 / 547.68 ms │      no change │
│ QQuery 10 │        75.26 / 76.44 ±1.33 / 78.94 ms │      103.42 / 107.95 ±2.40 / 109.84 ms │   1.41x slower │
│ QQuery 11 │        86.98 / 90.85 ±1.97 / 92.30 ms │      118.85 / 119.59 ±0.63 / 120.49 ms │   1.32x slower │
│ QQuery 12 │     276.80 / 286.15 ±9.95 / 299.28 ms │     305.75 / 328.33 ±14.74 / 351.87 ms │   1.15x slower │
│ QQuery 13 │     415.54 / 423.98 ±9.74 / 436.19 ms │     477.76 / 507.66 ±21.40 / 534.11 ms │   1.20x slower │
│ QQuery 14 │    291.35 / 308.91 ±10.04 / 321.77 ms │      346.27 / 356.43 ±9.18 / 369.14 ms │   1.15x slower │
│ QQuery 15 │    290.96 / 300.54 ±10.71 / 321.36 ms │     329.89 / 357.60 ±22.37 / 394.49 ms │   1.19x slower │
│ QQuery 16 │     649.42 / 653.43 ±4.08 / 659.70 ms │     684.40 / 705.74 ±16.13 / 733.45 ms │   1.08x slower │
│ QQuery 17 │    646.73 / 664.33 ±11.90 / 679.53 ms │     687.57 / 718.21 ±18.49 / 737.26 ms │   1.08x slower │
│ QQuery 18 │ 1289.41 / 1327.20 ±20.67 / 1348.71 ms │  1385.45 / 1495.64 ±58.98 / 1556.47 ms │   1.13x slower │
│ QQuery 19 │        29.00 / 30.86 ±2.77 / 36.31 ms │        36.91 / 47.92 ±14.27 / 75.74 ms │   1.55x slower │
│ QQuery 20 │    524.63 / 534.00 ±11.88 / 557.11 ms │     524.36 / 538.84 ±14.72 / 565.62 ms │      no change │
│ QQuery 21 │    599.18 / 613.02 ±11.91 / 633.48 ms │     592.71 / 605.61 ±10.20 / 623.79 ms │      no change │
│ QQuery 22 │ 1063.03 / 1081.45 ±10.20 / 1093.53 ms │     793.15 / 806.57 ±10.56 / 818.45 ms │  +1.34x faster │
│ QQuery 23 │ 3426.11 / 3480.51 ±41.17 / 3520.68 ms │     293.65 / 316.50 ±23.49 / 348.66 ms │ +11.00x faster │
│ QQuery 24 │        42.05 / 49.43 ±8.74 / 65.26 ms │         36.91 / 38.69 ±1.26 / 40.48 ms │  +1.28x faster │
│ QQuery 25 │     113.65 / 115.94 ±1.80 / 118.42 ms │      120.53 / 122.31 ±0.93 / 123.08 ms │   1.06x slower │
│ QQuery 26 │        42.24 / 44.87 ±4.17 / 53.16 ms │         65.20 / 65.84 ±0.54 / 66.78 ms │   1.47x slower │
│ QQuery 27 │     677.37 / 682.86 ±3.59 / 688.23 ms │      645.06 / 652.12 ±5.01 / 660.03 ms │      no change │
│ QQuery 28 │ 3061.37 / 3073.57 ±11.61 / 3088.45 ms │  3046.21 / 3085.02 ±32.45 / 3128.21 ms │      no change │
│ QQuery 29 │        42.34 / 45.93 ±4.71 / 54.46 ms │         47.89 / 51.28 ±4.42 / 59.63 ms │   1.12x slower │
│ QQuery 30 │     315.92 / 322.78 ±9.31 / 341.17 ms │     336.47 / 356.61 ±12.98 / 373.97 ms │   1.10x slower │
│ QQuery 31 │    305.30 / 319.38 ±10.58 / 332.77 ms │      332.38 / 342.30 ±5.21 / 346.86 ms │   1.07x slower │
│ QQuery 32 │ 1052.67 / 1075.35 ±26.91 / 1127.00 ms │  1219.08 / 1284.34 ±70.03 / 1417.93 ms │   1.19x slower │
│ QQuery 33 │ 1502.74 / 1549.27 ±53.72 / 1648.73 ms │  1567.91 / 1650.79 ±64.43 / 1754.33 ms │   1.07x slower │
│ QQuery 34 │ 1468.32 / 1527.95 ±36.24 / 1563.54 ms │  1564.53 / 1627.81 ±58.42 / 1717.89 ms │   1.07x slower │
│ QQuery 35 │    295.18 / 332.21 ±28.32 / 380.28 ms │     310.44 / 331.59 ±13.11 / 349.88 ms │      no change │
│ QQuery 36 │        65.25 / 72.16 ±6.63 / 83.29 ms │         61.41 / 65.46 ±5.36 / 75.68 ms │  +1.10x faster │
│ QQuery 37 │        37.55 / 41.91 ±5.00 / 49.03 ms │         35.11 / 36.40 ±0.80 / 37.29 ms │  +1.15x faster │
│ QQuery 38 │        41.30 / 43.58 ±1.57 / 45.27 ms │         36.83 / 39.39 ±2.40 / 43.18 ms │  +1.11x faster │
│ QQuery 39 │     135.02 / 145.34 ±5.43 / 150.57 ms │      115.27 / 124.28 ±7.86 / 137.16 ms │  +1.17x faster │
│ QQuery 40 │        15.64 / 18.50 ±4.21 / 26.82 ms │         19.56 / 21.38 ±1.49 / 23.82 ms │   1.16x slower │
│ QQuery 41 │        14.87 / 14.95 ±0.07 / 15.07 ms │         16.53 / 18.45 ±1.70 / 20.69 ms │   1.23x slower │
│ QQuery 42 │        14.56 / 14.76 ±0.19 / 15.00 ms │         14.37 / 15.00 ±0.42 / 15.49 ms │      no change │
└───────────┴───────────────────────────────────────┴────────────────────────────────────────┴────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                     │ 20919.95ms │
│ Total Time (filter-pushdown-with-row-group-morsels)   │ 18592.68ms │
│ Average Time (HEAD)                                   │   486.51ms │
│ Average Time (filter-pushdown-with-row-group-morsels) │   432.39ms │
│ Queries Faster                                        │          8 │
│ Queries Slower                                        │         25 │
│ Queries with No Change                                │         10 │
│ Queries with Failure                                  │          0 │
└───────────────────────────────────────────────────────┴────────────┘

Resource Usage

clickbench_partitioned — base (merge-base)

Metric	Value
Wall time	105.0s
Peak memory	30.6 GiB
Avg memory	23.1 GiB
CPU user	1106.9s
CPU sys	69.0s
Peak spill	0 B

clickbench_partitioned — branch

Metric	Value
Wall time	95.0s
Peak memory	34.6 GiB
Avg memory	27.0 GiB
CPU user	939.3s
CPU sys	100.5s
Peak spill	0 B

File an issue against this benchmark runner

adriangb · 2026-04-21T18:42:27Z

Local vs GKE are disagreeing — asking for an isolation run to nail it down

The headline from this run:

Q23: 3480 → 316 ms (11× faster) — the hash-join dynamic filter win; total time across the 43 queries drops 11 % (20.9 s → 18.6 s) on the strength of it.
25 queries slower, 8 faster — so although the aggregate improves, per-query the regressions are real.

My 10-iter local run on M-series showed the lazy-morsel branch (without any adaptive code) matching or beating apache/main on every query I tested, including Q15/Q16/Q17/Q24 at pushdown=off. GKE's 16-core Neoverse-V2 is disagreeing with that.

Looking at which queries regress here:

Category	Example	GKE delta
No filter (morsel split alone should matter)	Q15, Q16, Q17, Q32, Q33, Q34	7–19 % slower
Filter ⊆ projection	Q10, Q11, Q13, Q14, Q26	15–47 % slower
Highly selective filter	Q40	16 % slower
Hash-join dynamic filter	Q23	91 % faster

The no-filter regressions implicate either the morsel-split fan-out or the lazy-morsel wrapper itself, not the adaptive tracker — there's no predicate to adapt on those. That contradicts my local result, which means either (a) GKE's cache/NUMA/allocator behaviour exposes an overhead my M-series hides, or (b) the local run had less variance than I gave it credit for and GKE is closer to the truth.

To isolate, could you kick off a clickbench run on PR #10 with the same config? PR #10 is now exactly main + row-group-morsel-split + ParquetLazyMorsel — no adaptive code at all. If PR #10 on GKE also regresses on the no-filter queries, the fix needs to go in the morsel/lazy layer. If PR #10 is clean on GKE, the regression is in the adaptive layer and I can dig there.

Trigger:

run benchmark clickbench_partitioned
baseline:
    ref: main
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: false
       DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS: false
changed:
    ref: HEAD
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: false
       DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS: false

(posted on PR #10 directly, no adaptive env to set).

Once I know which layer owns the no-filter regression on GKE, the follow-up is clear:

If lazy-morsel layer — the Arc<LazyMorselShared> path may be adding allocator pressure that M-series's unified L3 hides; I can profile and see if e.g. ProjectionExprs::clone + try_map_exprs per-morsel is measurable on Neoverse.
If adaptive layer — the existing fast-path I just added (skip bookkeeping when post_scan_conjuncts is empty) isn't enough and we need to also skip shared.projection.clone() / try_map_exprs when the stream schema matches the projection schema.

🤖 Generated with Claude Code

adriangb · 2026-04-21T18:56:33Z

Local full 43-query ClickBench, 5 iterations, 4 configs (M-series, default morsel budgets)

             main+off     main+on     br+off     br+on    br/m-off    br/m-on
Total        26375 ms    20430 ms   29531 ms  23739 ms     1.12x      1.16x

Aggregate: branch is 12 % slower (pushdown=off) / 16 % slower (pushdown=on) than apache/main.

Surprise: Q23 is not the win the GKE report suggested

The recent GKE "11× faster on Q23" was comparing baseline=main+off (3480 ms) to branch+on (316 ms). Apples-to-apples (br+on vs main+on), Q23 is 2.28× slower than static pushdown on my laptop:

Q23	main+off	main+on	br+off	br+on
	3919.6 ms	119.3 ms	4259.3 ms	272.1 ms

Static RowFilter does 119 ms. Our adaptive does 272 ms. So yes, the adaptive system gives most of the arrow-rs speedup — but leaves 150 ms on the table that static pushdown doesn't. The GKE baseline config (pushdown=off) was the wrong comparison target for the "headline win" framing.

Biggest regressions at pushdown=on (br+on / main+on)

Q	ratio	main+on	br+on	shape
Q23	2.28×	119	272	`SELECT * WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10` (selective filter, wide projection)
Q34	2.12×	1488	3150	`GROUP BY 1, "URL"` — no filter
Q32	1.58×	2182	3453	`GROUP BY "WatchID", "ClientIP"` — no filter
Q3	1.36×	35	48	`SELECT AVG("UserID")` — no filter
Q4	1.17×	303	354	`SELECT COUNT(DISTINCT UserID)` — no filter
Q26	1.15×	57	65	`WHERE SearchPhrase <> '' ORDER BY EventTime LIMIT 10`
Q21, Q9, Q20, Q1, Q8	1.13-1.15×	—	—	mostly no-filter or full-row-group aggregations

Biggest wins at pushdown=on

Q	ratio	main+on	br+on	shape
Q25	0.80×	146	117	`WHERE MinuteOfDay = 43852`
Q24	0.82×	39	32	`WHERE SearchPhrase = '...'` specific string match
Q37	0.84×	30	25	multi-predicate + GROUP BY + HAVING
Q22	0.89×	988	875	complex WHERE ... LIKE

What this tells us

The no-filter regressions (Q3, Q32, Q33, Q34) are bigger than the filter ones. Q34 at 2× and Q32 at 1.58× aren't about filter adaptation — there's no predicate to adapt on. They're morsel-split + lazy-morsel overhead on wide-row-group full scans, visible on both pushdown=off and pushdown=on.
Q23 underperforming static pushdown by 2.28× says our adaptive path isn't quite getting out of the way when the static choice is unambiguously correct. Likely the per-morsel partition_filters + build_row_filter work on a selective LIKE filter over a wide projection is material.
The wins are real but smaller — Q24/Q25/Q37/Q22 gain 10-20% from adaptive placement picking a better point than static.

Next steps

I think we should separate the two concerns:

Morsel-split overhead on no-filter GROUP BYs (Q3, Q32, Q33, Q34). Two things I want to try: (a) profile ParquetLazyMorsel::build_stream_now on Q33 with samply to see where the 1.5× goes — suspect build_projection_read_plan walking a wide schema twice; (b) test whether increasing morsel_max_rows to the typical row-group size (~500k) eliminates the penalty by matching main's "1 decoder per file" behaviour.
Adaptive overhead on Q23-shape queries. Want to profile apply_post_scan_filters_with_stats vs. arrow-rs RowFilter::evaluate on the same LIKE predicate + selective selection mask to see whether the delta is placement decision or per-batch path cost.

Want me to proceed with (1) first? It's the larger absolute regression and its cause is narrower (no adaptive involved).

🤖 Generated with Claude Code

adriangb · 2026-04-21T20:16:38Z

PR #10 isolation — my morsel/lazy attribution was wrong

Ran full ClickBench-partitioned + a 10-iter drill-down comparing main, PR #10 (morsel-split + lazy, no adaptive), and this PR (adaptive). Headline:

Aggregate (full 43 queries, 5 iter)

	main+off	main+on	pr10+off	pr10+on	p/m+off	p/m+on
Total ms	25293	19994	24888	20515	0.98×	1.03×

PR #10 is essentially at parity with main — 2 % faster at pushdown=off, 3 % slower at pushdown=on. So the lazy-morsel refactor is clean on this hardware.

10-iter drill-down on the queries I flagged yesterday

Q	main+on	pr10+on	branch+on	branch/main	pr10/main
Q23	117	112	239	2.05×	0.96×
Q33	1724	1743	2297	1.33×	1.01×
Q34	1488	1507	1634	1.10×	1.01×
Q3	29	31	32	1.10×	1.05×
Q26	54	54	54	1.00×	1.00×
Q32	2383	3371	2551	1.07×	1.42×

Three things this tells us:

Yesterday's attribution was wrong. I said Q3/Q32/Q33/Q34's regressions on this PR came from morsel-split + lazy. They don't — PR feat: split Parquet files into row-group-sized morsels #10 has the same morsel-split + lazy stack without the adaptive code, and Q33/Q34/Q3/Q26 all match main to within 1–5 %. Q23 on PR feat: split Parquet files into row-group-sized morsels #10 is also clean (112 ms vs main's 117 ms). The regressions on this PR are coming from the adaptive layer itself.
Adaptive is hurting even when no filter is present. Q33 and Q34 have no WHERE clause at all, but this PR regresses them 33 % and 10 %. My fast-path commit (af2a26ff15) was supposed to take the no-filter, empty-post-scan path. Either it isn't firing or it isn't tight enough — the difference between PR feat: split Parquet files into row-group-sized morsels #10 and this PR on a query with pushdown=off and no predicate shouldn't exist, but it does.
Q32 is an outlier the other direction. On Q32 at pushdown=on, PR feat: split Parquet files into row-group-sized morsels #10 is 1.42× slower than main (3371 vs 2383) while this PR is only 1.07× slower (2551). Small sample, noisy, but it's the only query where adaptive beats no-adaptive at the same morsel infrastructure — worth keeping an eye on.

Where to look next

Given that PR #10 is at parity with main, the clean recommendation is: merge PR #10 as-is, and pursue the adaptive layer on this PR as a separate effort that has to prove itself against main+on, not against main+off.

On the adaptive layer, the specific investigation is: on a query with pushdown_filters=false (so partition_filters isn't called and post_scan_conjuncts is empty), what is ParquetLazyMorsel::build_stream_now doing differently from PR #10's version that costs 33 % on Q33? Candidates:

LazyMorselShared carries extra fields (predicate_conjuncts, selectivity_tracker, projection_compressed_bytes, total_rows). Arc'd so in-theory free, but struct size grew and maybe we're seeing allocation effects.
total_rows and projection_compressed_bytes are always computed in build_stream even when the adaptive path won't use them. Moving these inside the if has_adaptive_work branch would skip them at pushdown=off.
The per-morsel path still computes post_scan_other_bytes_per_row and read_plan_exprs via conditional branches — want to confirm the no-conjuncts branch really is taking the shortest possible path and not doing any work pr10's version doesn't.

Happy to profile build_stream_now with samply on Q33 next to pin this down, if you want to keep chasing adaptive. Otherwise the lazy-morsel PR #10 is a clean standalone improvement ready to land.

🤖 Generated with Claude Code

adriangb · 2026-04-21T23:26:31Z

Investigation summary + a rebase question

What the profile showed

Instrumented build_stream_now and transition::Data arm on both PR #10 and this branch, ran Q33 (no filter, URL GROUP BY) at 20 iterations:

Region	PR #10	PR #9	Δ
`build_stream_now` total	66 ms	216 ms	+150 ms (3.3×)
- create_reader	39 ms	154 ms	+115 ms (3.9×)
- decoder_builder.build()	2.1 ms	22 ms	+20 ms (10×)
transition io (all threads)	16.8 s CPU	24.0 s CPU	+7.2 s (+43%)
transition decode (all threads)	93.7 s CPU	100.3 s CPU	+6.6 s (+7%)

The multi-thread numbers were the big hint: at target_partitions=1, PR #9 was 8594 ms on Q33 vs PR #10's 8660 ms — not slower, slightly faster. So the regression is contention, not per-call cost.

Where the contention is coming from

Unverified conclusively, but the strong suspicion is this combination:

LazyMorselShared carries 4 extra fields that PR feat: split Parquet files into row-group-sized morsels #10 doesn't (predicate_conjuncts, selectivity_tracker, projection_compressed_bytes, total_rows), making its Arc'd allocation ~48 bytes bigger.
build_stream unconditionally walks every projection column × row group to compute projection_compressed_bytes and sums total_rows for every file, even when there's no predicate to adapt.
At high thread counts, the extra allocations + precomputation work interact with the global allocator and parquet-rs's internal state in a way that slows create_reader and decoder_builder.build() — operations that look identical to PR feat: split Parquet files into row-group-sized morsels #10 from the outside but take 4-10× longer under parallel load.

Ruled out:

The Option<Box<AdaptiveStreamState>> field on PushDecoderStreamState itself. Removed it entirely in an ablation build; gap unchanged.
Per-batch transition overhead. Per-call cost is ~470 ns vs PR feat: split Parquet files into row-group-sized morsels #10's ~410 ns — the per-batch overhead is tiny.
The tracker data structures. Microbench already showed those are ns-scale.

What helped so far

Commit 5856cca99b gates the file-level projection_compressed_bytes / total_rows precomputation on has_adaptive_work (pushdown on + non-empty predicate_conjuncts). Local 20-iter:

Query	main	before gate	after gate
Q3	32	48	29.5
Q26	56	65	53.5
Q33	1487	2297	1818
Q34	1520	1634	1799
Q23	111	272	225

Q3/Q26 closed fully. Q33/Q34 went from ~33%/10% slower to ~22%/18%. Q23 still 2× on filter queries.

Full 43-query aggregate (5 iter, pushdown=on)

	main	pr10	pr9_gate
Total	17159 ms	17279 ms	19144 ms
ratio/main	—	1.01×	1.12×

So this branch is now at 1.12× main aggregate (was 1.16× before). PR #10 is at parity.

What would help next

Want to keep going but the next step is structural and probably needs its own PR:

Fully split LazyMorselShared into a core part + Option<Arc<AdaptiveSharedState>>. Today PR Adaptive filter scheduling + row-group morsel split #9 carries the adaptive fields even when the query has no predicate. That makes the allocated Arc bigger, and under 16-worker parallel ClickBench load that shows up as 3-10× slower create_reader/decoder_builder.build(). If the adaptive state is pulled into its own Arc (allocated only when a predicate exists), PR Adaptive filter scheduling + row-group morsel split #9's shape on non-adaptive queries should match PR feat: split Parquet files into row-group-sized morsels #10 byte-for-byte.

On the rebase

I attempted to rebase this branch onto PR #10 cleanly, and the diff between PR #10 (8ef53b7ff0) and this branch's HEAD is 48 files / 3893+/1121- lines in datafusion/ alone — a lot of which is drift from apache/main at different points (workflow yml, .slt files, proto regen). A "real" rebase would mostly amount to picking the 3-4 real adaptive-filter commits from this branch on top of PR #10's tip, which is doable but meaningful work (several merge conflicts in opener.rs). Want me to just squash-cherrypick the adaptive delta as one commit on top of morsel-split-lazy, or would you prefer a proper commit-by-commit rebase? The squash is ~10 min of work, the proper rebase is closer to 2 hr.

🤖 Generated with Claude Code

Adds a runtime filter-placement layer on top of the row-group-morsel split introduced by the parent commit. Each Parquet predicate is assigned a `FilterId` and flows through a state machine (`SelectivityTracker`) that moves it between three placements: - `RowFilter` — evaluated inside the arrow-rs decoder as an `ArrowPredicate`, enabling late-materialization savings when the filter columns are a small fraction of the projection. - `PostScan` — evaluated against the decoded wide batch before the projector strips it; used when late materialization has little to save or when the filter is CPU-heavy. - `Dropped` — optional filters (hash-join dynamic filters wrapped in `OptionalFilterPhysicalExpr`) are skipped mid-stream when their CI upper bound on bytes-saved-per-second falls below a minimum. Initial placement uses a cheap byte-ratio heuristic (`filter_compressed_bytes / projection_compressed_bytes`); subsequent placements refine using Welford online stats reported from both the row-filter path (`DatafusionArrowPredicate::evaluate`) and the post-scan path (`apply_post_scan_filters_with_stats`). Placement is re-evaluated per morsel, so stats from the prior morsel's scan feed into the next morsel's decision. Config knobs on `TableParquetOptions.execution.parquet`: - `filter_pushdown_min_bytes_per_sec` (default 100 MB/s) - `filter_collecting_byte_ratio_threshold` (default 0.20) - `filter_confidence_z` (default 2.0 ≈ 97.5% one-sided CI) The `reorder_filters` option is removed; the adaptive tracker subsumes its role. Notable trade-offs documented in PR discussion: - The adaptive layer adds ~10 % aggregate ClickBench overhead vs the pure morsel-split base (PR #10). Most of it lives in `ParquetLazyMorsel::build_stream_now` under parallel load; single- thread shows no regression. Candidate fix is splitting adaptive state out of `LazyMorselShared` so non-adaptive queries get the same `Arc` allocation shape as PR #10. - The `OptionalFilterPhysicalExpr` wrapper changes plan display output (`DynamicFilter [...]` → `Optional(DynamicFilter [...])`); several sqllogictest expected outputs and snapshot tests were updated accordingly. - A selectivity-tracker microbench was added under `benches/selectivity_tracker.rs` so future iterations on the tracker can be measured independently of full ClickBench. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

adriangb · 2026-04-22T01:05:38Z

Rebase done — and it uncovered the root cause

Squashed the adaptive delta as a single commit on top of PR #10's morsel-split-lazy:

filter-pushdown-rebased = ff805cf67e (morsel) → 8ef53b7ff0 (lazy) → dbcf5ac1e7 (adaptive)
3 commits, 43 files changed in the adaptive commit, all tests pass, clippy clean.

But the interesting thing is what rebasing did to the numbers

Full 43-query ClickBench, pushdown=on, 5 iter:

	main	pr10	rebased
Sum (ms)	16659	16418	16046
vs main	—	0.99×	0.96×

The rebased branch is 4 % faster than apache/main in aggregate — and faster than PR #10 too. Contrast with the un-rebased branch which was 1.12× slower than main last run.

Q33 verification across 3 trials (the query that was the stickiest regression on the un-rebased PR):

	trial 1	trial 2	trial 3
un-rebased	1810 ms	1989 ms	1929 ms
rebased	1459 ms	1482 ms	1462 ms

Rebased Q33 is at parity with PR #10 (1443 ms), ~24 % faster than the un-rebased branch.

opener.rs is byte-for-byte identical between the two branches. The entire difference is what changed in apache/main between the un-rebased base and the current base. The most likely cause is 526f0cb10e perf: Reduce Box and Arc allocation churn during tree rewriting (#21749), which is exactly the kind of thing my earlier investigation pointed at (Q33/Q34 contention on parallel allocator under the larger Arc<LazyMorselShared> allocation). The upstream fix reduces allocation churn in a path that this PR exercises heavily; that's what was amplifying the 48-byte larger LazyMorselShared into a 25 % wall-clock regression.

Per-query vs main on the rebased branch (top 10 each, pushdown=on)

Wins:

Q	ratio	main	rebased
Q22	0.81×	969	781
Q32	0.81×	1425	1158
Q18	0.86×	1565	1349
Q25	0.87×	135	117

Regressions:

Q	ratio	main	rebased
Q23	1.78×	119	212
Q2	1.07×	35	37
Q16, Q21	1.03×	—	—

Q23 still regresses ~1.8× — this is the "filter ⊆ projection with a selective LIKE" query I diagnosed earlier. Adaptive's byte-ratio heuristic starts it as PostScan (byte_ratio ≈ 1), and with other_bytes_per_row ≈ 0 the CI lower bound never crosses min_bytes_per_sec, so promotion never fires. Static main+on routes it to RowFilter directly and wins by 100 ms. This is a real adaptive-heuristic issue, separable from the contention problem.

Where this leaves us

The contention problem is solved by the rebase alone (upstream caught up).
Q23 remains the one clear adaptive-specific regression, with a clear conceptual fix (add a selectivity-rate promotion path alongside the bytes-per-sec one). Separate follow-up.
PR Adaptive filter scheduling + row-group morsel split #9 can now retarget to filter-pushdown-rebased if you want a clean 3-commit PR on top of morsel-split-lazy, or we can keep the existing 15-commit history if you prefer preserving the evolution.

🤖 Generated with Claude Code

github-actions Bot added documentation Improvements or additions to documentation physical-expr physical-plan common datasource core proto sqllogictest labels Apr 20, 2026

adriangb mentioned this pull request Apr 21, 2026

feat: split Parquet files into row-group-sized morsels #10

Open

adriangb force-pushed the filter-pushdown-with-row-group-morsels branch from 5856cca to dbcf5ac Compare April 22, 2026 12:12

Conversation

adriangb commented Apr 20, 2026

Summary

Merge resolution

Test plan

Uh oh!

adriangb commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangb commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangb commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangb commented Apr 21, 2026

Regression diagnosis: it's the adaptive tracker itself, not the placement

Local A/B with four placement strategies (5 iterations, M-series)

Where the overhead is

What does this tell us

Possible fixes, smallest → largest

On the morsel split itself

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangb commented Apr 21, 2026

Isolation experiment: where does the overhead actually live?

pushdown=off

pushdown=on

Findings

What changed in full-opt

Next

Uh oh!

adriangb commented Apr 21, 2026

Uh oh!

adriangb commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangbot commented Apr 21, 2026

Uh oh!

adriangb commented Apr 21, 2026

Local vs GKE are disagreeing — asking for an isolation run to nail it down

Uh oh!

adriangb commented Apr 21, 2026

Local full 43-query ClickBench, 5 iterations, 4 configs (M-series, default morsel budgets)

Surprise: Q23 is not the win the GKE report suggested

Biggest regressions at pushdown=on (br+on / main+on)

Biggest wins at pushdown=on

What this tells us

Next steps

Uh oh!

adriangb commented Apr 21, 2026

What changed in `full-opt`