Skip to content

Adaptive filter scheduling + row-group morsel split#9

Open
adriangb wants to merge 3 commits intomainfrom
filter-pushdown-with-row-group-morsels
Open

Adaptive filter scheduling + row-group morsel split#9
adriangb wants to merge 3 commits intomainfrom
filter-pushdown-with-row-group-morsels

Conversation

@adriangb
Copy link
Copy Markdown
Owner

Summary

Mashup of two in-flight PRs, branched off adriangb/filter-pushdown-dynamic-bytes-morsels with pydantic#59 cherry-picked on top:

Merge resolution

The two PRs both touched datafusion/datasource-parquet/src/opener.rs::build_stream. The merge keeps:

  • File-level (once per open): partition_filters bucket split → row-filter vs post-scan, build_row_filter invoked once to drain unbuildable back into post-scan (wrong-result guard), post_scan_other_bytes_per_row precompute, read-plan projection mask over the union of projection + post-scan columns, rebased projection/post-scan exprs against stream_schema.
  • Per chunk (in the morsel loop): rebuild RowFilter from the (stable) row_filter_conjuncts list — RowFilter is not Clone; build the decoder with projection_mask.clone(); mint a fresh AsyncFileReader (chunk 0 reuses the warm one); clone post_scan_filters and post_scan_other_bytes_per_row into each PushDecoderStreamState; decoder-level .with_limit() still only applied when post_scan_filters.is_empty(); EarlyStoppingStream wraps chunk 0 only.

datafusion/datasource-parquet/src/access_plan.rs and datafusion/datasource-parquet/src/source.rs applied clean.

Test plan

  • cargo check -p datafusion-datasource-parquet
  • cargo clippy -p datafusion-datasource-parquet --all-targets --all-features -- -D warnings
  • cargo test -p datafusion-datasource-parquet --lib — 156 passed, including test_row_group_split_* and full test_reverse_scan_* suite
  • cargo fmt --all
  • Broader workspace tests — not run here; apache-main already has an unrelated clippy::mutable_key_type error in datafusion-expr that fails workspace clippy.

🤖 Generated with Claude Code

@adriangb
Copy link
Copy Markdown
Owner Author

run benchmarks

baseline:
    ref: main
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: false
       DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS: false
changed:
    ref: HEAD
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: true

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4285036605-1634-ldphc 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (3c51143) to main diff using: clickbench_partitioned
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4285036605-1636-wfks9 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (3c51143) to main diff using: tpch
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4285036605-1635-r26rw 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (3c51143) to main diff using: tpcds
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and filter-pushdown-with-row-group-morsels
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃ filter-pushdown-with-row-group-morsels ┃         Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ QQuery 0  │          1.16 / 4.34 ±6.27 / 16.87 ms │           1.16 / 4.34 ±6.28 / 16.89 ms │      no change │
│ QQuery 1  │        12.89 / 13.25 ±0.20 / 13.44 ms │         14.38 / 15.00 ±0.44 / 15.61 ms │   1.13x slower │
│ QQuery 2  │        38.06 / 38.27 ±0.16 / 38.51 ms │         39.27 / 39.41 ±0.09 / 39.55 ms │      no change │
│ QQuery 3  │        31.66 / 32.23 ±0.74 / 33.62 ms │         32.81 / 33.26 ±0.42 / 33.88 ms │      no change │
│ QQuery 4  │     254.82 / 258.49 ±5.52 / 269.38 ms │      247.40 / 253.48 ±4.06 / 260.12 ms │      no change │
│ QQuery 5  │     294.66 / 299.04 ±3.55 / 305.38 ms │      290.90 / 300.21 ±6.68 / 307.83 ms │      no change │
│ QQuery 6  │          6.26 / 9.72 ±3.21 / 14.38 ms │            5.20 / 5.79 ±0.70 / 7.14 ms │  +1.68x faster │
│ QQuery 7  │        14.84 / 15.04 ±0.18 / 15.32 ms │         16.25 / 16.40 ±0.11 / 16.58 ms │   1.09x slower │
│ QQuery 8  │     362.00 / 374.97 ±9.07 / 389.24 ms │     354.80 / 368.40 ±12.84 / 391.10 ms │      no change │
│ QQuery 9  │     510.85 / 524.45 ±9.60 / 538.31 ms │     497.05 / 518.01 ±13.05 / 534.54 ms │      no change │
│ QQuery 10 │        76.41 / 77.70 ±1.49 / 80.61 ms │       99.91 / 101.81 ±3.31 / 108.41 ms │   1.31x slower │
│ QQuery 11 │        87.09 / 87.47 ±0.24 / 87.83 ms │      109.21 / 110.19 ±0.53 / 110.67 ms │   1.26x slower │
│ QQuery 12 │     287.55 / 297.79 ±7.73 / 305.40 ms │      285.61 / 295.83 ±7.23 / 304.13 ms │      no change │
│ QQuery 13 │    413.96 / 431.06 ±15.90 / 459.65 ms │     452.22 / 474.10 ±19.11 / 509.43 ms │   1.10x slower │
│ QQuery 14 │     304.25 / 306.25 ±2.09 / 309.07 ms │      338.84 / 342.41 ±3.63 / 346.87 ms │   1.12x slower │
│ QQuery 15 │    317.56 / 332.41 ±13.54 / 354.95 ms │     324.14 / 339.95 ±14.73 / 363.97 ms │      no change │
│ QQuery 16 │     666.74 / 672.92 ±5.47 / 682.80 ms │      680.91 / 689.64 ±5.54 / 696.00 ms │      no change │
│ QQuery 17 │     672.05 / 674.81 ±3.04 / 680.04 ms │      673.79 / 677.59 ±4.82 / 686.00 ms │      no change │
│ QQuery 18 │ 1369.96 / 1428.18 ±33.05 / 1467.17 ms │  1440.43 / 1484.93 ±33.04 / 1519.74 ms │      no change │
│ QQuery 19 │        30.96 / 35.01 ±6.96 / 48.90 ms │         30.91 / 33.26 ±1.83 / 36.27 ms │      no change │
│ QQuery 20 │    516.20 / 528.43 ±11.48 / 548.22 ms │      518.88 / 525.65 ±8.26 / 540.97 ms │      no change │
│ QQuery 21 │     589.79 / 593.88 ±2.39 / 596.59 ms │      577.68 / 582.63 ±4.90 / 588.96 ms │      no change │
│ QQuery 22 │  1049.80 / 1053.57 ±2.74 / 1058.17 ms │      799.51 / 812.32 ±6.84 / 818.88 ms │  +1.30x faster │
│ QQuery 23 │ 3261.35 / 3290.93 ±22.71 / 3313.94 ms │      283.51 / 292.34 ±5.90 / 301.87 ms │ +11.26x faster │
│ QQuery 24 │        44.03 / 47.87 ±2.54 / 51.89 ms │         39.45 / 41.77 ±1.86 / 44.05 ms │  +1.15x faster │
│ QQuery 25 │     115.34 / 116.06 ±0.79 / 117.03 ms │      120.26 / 122.08 ±1.34 / 124.24 ms │   1.05x slower │
│ QQuery 26 │        42.87 / 44.92 ±1.41 / 47.03 ms │         57.27 / 59.36 ±1.69 / 61.12 ms │   1.32x slower │
│ QQuery 27 │     659.34 / 664.12 ±4.11 / 671.69 ms │      644.05 / 648.14 ±3.78 / 652.92 ms │      no change │
│ QQuery 28 │ 2973.12 / 2997.72 ±15.45 / 3013.48 ms │  2982.64 / 3013.74 ±17.43 / 3034.69 ms │      no change │
│ QQuery 29 │        44.41 / 48.41 ±3.79 / 54.84 ms │         45.26 / 49.96 ±4.71 / 57.81 ms │      no change │
│ QQuery 30 │     328.34 / 335.32 ±7.33 / 348.39 ms │      332.72 / 336.99 ±5.17 / 346.86 ms │      no change │
│ QQuery 31 │     335.27 / 340.42 ±3.57 / 345.43 ms │      328.54 / 333.86 ±3.41 / 338.41 ms │      no change │
│ QQuery 32 │ 1165.27 / 1214.51 ±39.67 / 1260.08 ms │  1184.85 / 1246.93 ±36.11 / 1288.69 ms │      no change │
│ QQuery 33 │ 1521.91 / 1586.67 ±38.36 / 1632.12 ms │  1431.19 / 1540.04 ±64.00 / 1595.49 ms │      no change │
│ QQuery 34 │  1483.22 / 1496.57 ±7.53 / 1505.61 ms │  1460.79 / 1512.56 ±59.65 / 1625.64 ms │      no change │
│ QQuery 35 │     311.29 / 317.20 ±5.15 / 326.63 ms │     308.66 / 315.93 ±10.47 / 336.56 ms │      no change │
│ QQuery 36 │        66.26 / 68.66 ±2.94 / 73.77 ms │         61.89 / 64.74 ±2.06 / 67.77 ms │  +1.06x faster │
│ QQuery 37 │        36.88 / 38.26 ±1.17 / 40.42 ms │         33.75 / 35.04 ±0.68 / 35.66 ms │  +1.09x faster │
│ QQuery 38 │        40.28 / 42.55 ±1.94 / 45.85 ms │         36.48 / 38.10 ±1.31 / 39.68 ms │  +1.12x faster │
│ QQuery 39 │     124.48 / 129.35 ±4.58 / 137.06 ms │      115.11 / 118.85 ±4.97 / 128.47 ms │  +1.09x faster │
│ QQuery 40 │        17.02 / 19.09 ±1.45 / 21.23 ms │         18.09 / 20.15 ±1.25 / 21.76 ms │   1.06x slower │
│ QQuery 41 │        14.87 / 15.28 ±0.29 / 15.74 ms │         15.86 / 17.61 ±1.29 / 19.82 ms │   1.15x slower │
│ QQuery 42 │        13.71 / 14.25 ±0.32 / 14.67 ms │         15.25 / 15.66 ±0.28 / 16.00 ms │   1.10x slower │
└───────────┴───────────────────────────────────────┴────────────────────────────────────────┴────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                     │ 20917.44ms │
│ Total Time (filter-pushdown-with-row-group-morsels)   │ 17848.43ms │
│ Average Time (HEAD)                                   │   486.45ms │
│ Average Time (filter-pushdown-with-row-group-morsels) │   415.08ms │
│ Queries Faster                                        │          8 │
│ Queries Slower                                        │         11 │
│ Queries with No Change                                │         24 │
│ Queries with Failure                                  │          0 │
└───────────────────────────────────────────────────────┴────────────┘

Resource Usage

clickbench_partitioned — base (merge-base)

Metric Value
Wall time 110.0s
Peak memory 36.0 GiB
Avg memory 27.2 GiB
CPU user 1073.4s
CPU sys 98.8s
Peak spill 0 B

clickbench_partitioned — branch

Metric Value
Wall time 95.0s
Peak memory 36.7 GiB
Avg memory 27.8 GiB
CPU user 901.3s
CPU sys 93.3s
Peak spill 0 B

File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and filter-pushdown-with-row-group-morsels
--------------------
Benchmark tpcds_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                     HEAD ┃   filter-pushdown-with-row-group-morsels ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │              6.43 / 6.87 ±0.66 / 8.19 ms │              6.18 / 6.60 ±0.72 / 8.03 ms │     no change │
│ QQuery 2  │        146.11 / 147.96 ±1.48 / 150.04 ms │        108.52 / 109.26 ±0.68 / 110.19 ms │ +1.35x faster │
│ QQuery 3  │        114.10 / 115.16 ±0.97 / 116.61 ms │        125.29 / 125.98 ±0.66 / 127.24 ms │  1.09x slower │
│ QQuery 4  │    1296.51 / 1329.06 ±18.92 / 1348.79 ms │    1018.98 / 1040.66 ±11.59 / 1051.48 ms │ +1.28x faster │
│ QQuery 5  │        172.51 / 174.22 ±1.13 / 175.68 ms │        172.10 / 175.81 ±2.95 / 180.32 ms │     no change │
│ QQuery 6  │       819.04 / 841.40 ±18.56 / 870.64 ms │        203.20 / 210.17 ±6.88 / 219.28 ms │ +4.00x faster │
│ QQuery 7  │        338.47 / 342.49 ±3.13 / 347.74 ms │        324.54 / 333.00 ±5.34 / 339.36 ms │     no change │
│ QQuery 8  │        116.75 / 117.80 ±0.66 / 118.59 ms │        139.40 / 143.24 ±2.70 / 147.27 ms │  1.22x slower │
│ QQuery 9  │        100.48 / 103.18 ±2.04 / 105.62 ms │         94.92 / 102.26 ±4.35 / 108.50 ms │     no change │
│ QQuery 10 │        104.46 / 106.69 ±1.46 / 108.22 ms │        134.79 / 143.83 ±5.80 / 152.84 ms │  1.35x slower │
│ QQuery 11 │       867.92 / 879.60 ±10.19 / 896.96 ms │       668.96 / 685.49 ±10.99 / 700.33 ms │ +1.28x faster │
│ QQuery 12 │           44.01 / 45.73 ±1.42 / 47.67 ms │           36.25 / 38.09 ±1.07 / 39.29 ms │ +1.20x faster │
│ QQuery 13 │        396.32 / 399.19 ±2.25 / 402.43 ms │        551.31 / 562.55 ±8.67 / 575.18 ms │  1.41x slower │
│ QQuery 14 │     1004.16 / 1008.17 ±4.54 / 1015.95 ms │        872.97 / 884.03 ±7.33 / 892.02 ms │ +1.14x faster │
│ QQuery 15 │           15.39 / 16.50 ±1.10 / 18.54 ms │           18.20 / 19.72 ±1.74 / 22.82 ms │  1.19x slower │
│ QQuery 16 │              7.29 / 7.66 ±0.28 / 7.97 ms │              6.91 / 7.44 ±0.75 / 8.88 ms │     no change │
│ QQuery 17 │        227.71 / 230.46 ±2.26 / 234.46 ms │        177.89 / 179.95 ±2.01 / 183.73 ms │ +1.28x faster │
│ QQuery 18 │        127.72 / 129.08 ±0.87 / 130.04 ms │        177.99 / 187.00 ±5.21 / 192.94 ms │  1.45x slower │
│ QQuery 19 │        153.64 / 155.03 ±1.17 / 157.05 ms │        141.44 / 144.53 ±2.00 / 146.91 ms │ +1.07x faster │
│ QQuery 20 │           13.35 / 14.40 ±0.61 / 14.99 ms │           15.19 / 16.23 ±0.96 / 17.73 ms │  1.13x slower │
│ QQuery 21 │           19.27 / 19.77 ±0.44 / 20.58 ms │           20.63 / 21.35 ±0.38 / 21.71 ms │  1.08x slower │
│ QQuery 22 │        482.88 / 488.66 ±3.92 / 495.06 ms │        491.01 / 492.51 ±1.10 / 493.73 ms │     no change │
│ QQuery 23 │       871.84 / 886.19 ±10.37 / 898.48 ms │       836.36 / 861.87 ±15.27 / 883.92 ms │     no change │
│ QQuery 24 │        383.59 / 388.47 ±4.88 / 395.66 ms │        124.03 / 127.27 ±1.92 / 129.14 ms │ +3.05x faster │
│ QQuery 25 │        342.74 / 344.75 ±1.61 / 346.91 ms │        284.99 / 288.19 ±1.90 / 290.92 ms │ +1.20x faster │
│ QQuery 26 │           80.63 / 81.90 ±0.79 / 82.84 ms │        142.24 / 148.05 ±7.63 / 163.02 ms │  1.81x slower │
│ QQuery 27 │              7.15 / 7.75 ±0.37 / 8.28 ms │              6.58 / 7.08 ±0.42 / 7.71 ms │ +1.09x faster │
│ QQuery 28 │        149.52 / 151.26 ±1.74 / 154.52 ms │        149.90 / 152.88 ±2.30 / 155.91 ms │     no change │
│ QQuery 29 │        283.31 / 284.75 ±1.33 / 287.00 ms │        222.70 / 225.44 ±3.85 / 233.03 ms │ +1.26x faster │
│ QQuery 30 │           43.75 / 45.57 ±2.15 / 49.66 ms │           53.63 / 55.63 ±1.96 / 59.24 ms │  1.22x slower │
│ QQuery 31 │        170.54 / 172.35 ±1.35 / 173.64 ms │        169.48 / 170.63 ±0.78 / 171.77 ms │     no change │
│ QQuery 32 │           13.58 / 13.95 ±0.30 / 14.30 ms │           13.88 / 14.40 ±0.54 / 15.42 ms │     no change │
│ QQuery 33 │        139.25 / 140.75 ±1.29 / 142.91 ms │        127.82 / 132.75 ±2.63 / 135.61 ms │ +1.06x faster │
│ QQuery 34 │              6.79 / 7.10 ±0.23 / 7.48 ms │              6.86 / 7.08 ±0.28 / 7.62 ms │     no change │
│ QQuery 35 │        107.16 / 107.59 ±0.33 / 108.07 ms │        107.47 / 109.69 ±1.93 / 112.30 ms │     no change │
│ QQuery 36 │              6.55 / 6.68 ±0.19 / 7.04 ms │              6.28 / 6.45 ±0.21 / 6.83 ms │     no change │
│ QQuery 37 │              8.09 / 8.64 ±0.35 / 9.02 ms │              5.02 / 5.24 ±0.25 / 5.71 ms │ +1.65x faster │
│ QQuery 38 │           84.72 / 88.62 ±3.02 / 92.73 ms │           89.32 / 93.83 ±3.11 / 98.57 ms │  1.06x slower │
│ QQuery 39 │        123.90 / 125.66 ±1.27 / 127.35 ms │        132.10 / 133.71 ±1.31 / 135.41 ms │  1.06x slower │
│ QQuery 40 │        111.13 / 115.07 ±3.96 / 121.82 ms │        118.38 / 123.58 ±5.87 / 134.77 ms │  1.07x slower │
│ QQuery 41 │           14.06 / 14.70 ±0.70 / 15.82 ms │           16.14 / 17.46 ±1.01 / 18.71 ms │  1.19x slower │
│ QQuery 42 │        106.83 / 108.66 ±1.26 / 110.34 ms │        109.07 / 110.68 ±1.54 / 113.10 ms │     no change │
│ QQuery 43 │              5.56 / 5.68 ±0.13 / 5.94 ms │              5.33 / 5.60 ±0.17 / 5.75 ms │     no change │
│ QQuery 44 │           11.28 / 11.74 ±0.37 / 12.16 ms │           10.90 / 11.50 ±0.37 / 12.01 ms │     no change │
│ QQuery 45 │           50.71 / 52.15 ±0.88 / 53.20 ms │           42.64 / 44.49 ±1.49 / 46.48 ms │ +1.17x faster │
│ QQuery 46 │              8.14 / 8.45 ±0.32 / 8.98 ms │              8.00 / 8.31 ±0.25 / 8.65 ms │     no change │
│ QQuery 47 │        684.22 / 689.23 ±4.20 / 694.83 ms │       684.50 / 705.23 ±14.95 / 720.53 ms │     no change │
│ QQuery 48 │        290.17 / 291.66 ±1.21 / 293.64 ms │        360.18 / 362.30 ±1.49 / 363.74 ms │  1.24x slower │
│ QQuery 49 │        251.79 / 254.53 ±2.11 / 257.40 ms │        238.32 / 241.96 ±3.16 / 247.53 ms │     no change │
│ QQuery 50 │        223.88 / 230.01 ±3.69 / 233.33 ms │        240.18 / 245.23 ±3.76 / 249.82 ms │  1.07x slower │
│ QQuery 51 │        181.79 / 185.32 ±3.79 / 192.18 ms │        211.07 / 213.21 ±2.23 / 217.01 ms │  1.15x slower │
│ QQuery 52 │        107.79 / 108.30 ±0.73 / 109.74 ms │        104.66 / 109.35 ±3.41 / 114.13 ms │     no change │
│ QQuery 53 │        102.23 / 102.79 ±0.71 / 104.01 ms │        143.19 / 147.23 ±2.83 / 152.02 ms │  1.43x slower │
│ QQuery 54 │        144.13 / 147.78 ±2.19 / 150.62 ms │        125.23 / 128.89 ±2.20 / 131.64 ms │ +1.15x faster │
│ QQuery 55 │        106.35 / 106.91 ±0.59 / 107.93 ms │        107.27 / 111.08 ±2.08 / 113.17 ms │     no change │
│ QQuery 56 │        140.93 / 143.24 ±1.38 / 145.22 ms │        129.86 / 132.09 ±1.51 / 133.79 ms │ +1.08x faster │
│ QQuery 57 │        171.55 / 173.73 ±1.76 / 176.51 ms │        180.81 / 182.84 ±1.59 / 185.55 ms │  1.05x slower │
│ QQuery 58 │        273.91 / 275.23 ±1.22 / 277.14 ms │        229.47 / 232.58 ±2.36 / 236.00 ms │ +1.18x faster │
│ QQuery 59 │        197.44 / 200.48 ±1.91 / 203.03 ms │        229.35 / 230.97 ±1.10 / 232.80 ms │  1.15x slower │
│ QQuery 60 │        141.98 / 143.45 ±1.46 / 145.32 ms │        134.61 / 141.10 ±8.93 / 158.77 ms │     no change │
│ QQuery 61 │           12.74 / 12.94 ±0.26 / 13.45 ms │           12.00 / 12.52 ±0.59 / 13.64 ms │     no change │
│ QQuery 62 │       889.61 / 901.88 ±11.68 / 923.49 ms │       866.65 / 927.24 ±33.31 / 955.31 ms │     no change │
│ QQuery 63 │        105.30 / 107.23 ±1.51 / 109.78 ms │        141.75 / 146.48 ±3.52 / 150.51 ms │  1.37x slower │
│ QQuery 64 │        683.30 / 687.86 ±4.14 / 695.21 ms │        719.11 / 732.53 ±7.47 / 741.44 ms │  1.06x slower │
│ QQuery 65 │        248.99 / 253.22 ±2.35 / 255.55 ms │        321.52 / 327.97 ±4.77 / 335.56 ms │  1.30x slower │
│ QQuery 66 │       242.27 / 257.79 ±10.96 / 271.11 ms │        171.83 / 187.42 ±8.19 / 196.07 ms │ +1.38x faster │
│ QQuery 67 │        316.42 / 319.54 ±2.70 / 324.50 ms │        472.48 / 482.74 ±8.86 / 493.10 ms │  1.51x slower │
│ QQuery 68 │             8.50 / 9.76 ±0.81 / 10.66 ms │            9.04 / 10.81 ±1.23 / 12.33 ms │  1.11x slower │
│ QQuery 69 │        101.70 / 103.98 ±1.40 / 105.93 ms │        134.56 / 141.93 ±4.02 / 146.56 ms │  1.36x slower │
│ QQuery 70 │       336.50 / 352.53 ±13.50 / 372.27 ms │        371.03 / 382.30 ±8.92 / 398.50 ms │  1.08x slower │
│ QQuery 71 │        137.38 / 139.60 ±3.30 / 146.06 ms │        127.51 / 132.18 ±2.38 / 134.18 ms │ +1.06x faster │
│ QQuery 72 │        618.04 / 627.34 ±5.87 / 633.71 ms │        482.70 / 490.97 ±5.15 / 498.01 ms │ +1.28x faster │
│ QQuery 73 │              6.61 / 7.56 ±0.69 / 8.74 ms │              6.52 / 7.32 ±0.62 / 8.41 ms │     no change │
│ QQuery 74 │       560.60 / 583.87 ±17.13 / 612.19 ms │        478.29 / 490.31 ±6.22 / 495.16 ms │ +1.19x faster │
│ QQuery 75 │        276.85 / 277.89 ±0.98 / 279.40 ms │        271.12 / 275.03 ±3.58 / 280.67 ms │     no change │
│ QQuery 76 │        131.17 / 133.53 ±2.50 / 138.27 ms │        153.92 / 156.75 ±1.84 / 159.51 ms │  1.17x slower │
│ QQuery 77 │        187.62 / 189.61 ±1.82 / 192.85 ms │        204.61 / 206.55 ±1.68 / 208.89 ms │  1.09x slower │
│ QQuery 78 │        336.84 / 341.57 ±2.75 / 345.26 ms │        301.59 / 308.85 ±4.60 / 313.46 ms │ +1.11x faster │
│ QQuery 79 │        232.91 / 234.83 ±1.98 / 238.37 ms │        255.97 / 261.87 ±3.64 / 265.95 ms │  1.12x slower │
│ QQuery 80 │        323.44 / 327.34 ±2.40 / 330.75 ms │        256.94 / 261.65 ±2.45 / 263.59 ms │ +1.25x faster │
│ QQuery 81 │           27.62 / 29.18 ±1.16 / 30.68 ms │           31.25 / 32.77 ±1.20 / 34.61 ms │  1.12x slower │
│ QQuery 82 │           40.22 / 41.78 ±0.87 / 42.71 ms │           44.75 / 46.01 ±1.26 / 48.34 ms │  1.10x slower │
│ QQuery 83 │           37.95 / 38.95 ±0.91 / 40.53 ms │           41.97 / 42.71 ±0.53 / 43.61 ms │  1.10x slower │
│ QQuery 84 │           47.96 / 49.38 ±0.81 / 50.34 ms │           51.71 / 52.27 ±0.42 / 52.99 ms │  1.06x slower │
│ QQuery 85 │        147.29 / 148.94 ±1.72 / 151.98 ms │        213.38 / 216.94 ±3.36 / 222.29 ms │  1.46x slower │
│ QQuery 86 │           38.75 / 39.64 ±0.58 / 40.34 ms │           42.37 / 43.81 ±0.77 / 44.49 ms │  1.11x slower │
│ QQuery 87 │           85.76 / 88.29 ±2.38 / 92.79 ms │           87.38 / 92.28 ±3.97 / 99.10 ms │     no change │
│ QQuery 88 │         98.96 / 100.22 ±0.86 / 101.49 ms │        120.51 / 121.40 ±1.02 / 123.30 ms │  1.21x slower │
│ QQuery 89 │        119.32 / 120.44 ±0.89 / 121.46 ms │        142.91 / 152.99 ±5.34 / 158.52 ms │  1.27x slower │
│ QQuery 90 │           22.62 / 23.32 ±0.42 / 23.78 ms │           23.29 / 24.07 ±0.92 / 25.87 ms │     no change │
│ QQuery 91 │           63.09 / 63.46 ±0.23 / 63.81 ms │        106.76 / 109.15 ±2.15 / 112.40 ms │  1.72x slower │
│ QQuery 92 │           58.10 / 58.71 ±0.65 / 59.96 ms │           57.41 / 58.86 ±1.42 / 61.49 ms │     no change │
│ QQuery 93 │        186.51 / 189.01 ±2.27 / 192.58 ms │        188.51 / 189.96 ±1.66 / 193.12 ms │     no change │
│ QQuery 94 │           61.30 / 62.03 ±0.47 / 62.74 ms │           69.34 / 70.05 ±0.47 / 70.49 ms │  1.13x slower │
│ QQuery 95 │        129.09 / 130.55 ±1.57 / 133.40 ms │        132.46 / 134.58 ±2.31 / 138.78 ms │     no change │
│ QQuery 96 │           72.21 / 73.41 ±1.09 / 74.77 ms │           85.65 / 91.29 ±2.94 / 94.20 ms │  1.24x slower │
│ QQuery 97 │        125.67 / 127.11 ±1.62 / 130.19 ms │        152.45 / 154.62 ±1.61 / 157.33 ms │  1.22x slower │
│ QQuery 98 │        152.02 / 155.78 ±2.15 / 158.19 ms │        114.88 / 118.48 ±2.24 / 121.66 ms │ +1.31x faster │
│ QQuery 99 │ 10766.32 / 10793.80 ±16.81 / 10816.77 ms │ 10764.64 / 10783.69 ±12.28 / 10803.08 ms │     no change │
└───────────┴──────────────────────────────────────────┴──────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                     │ 31114.06ms │
│ Total Time (filter-pushdown-with-row-group-morsels)   │ 30154.88ms │
│ Average Time (HEAD)                                   │   314.28ms │
│ Average Time (filter-pushdown-with-row-group-morsels) │   304.59ms │
│ Queries Faster                                        │         25 │
│ Queries Slower                                        │         42 │
│ Queries with No Change                                │         32 │
│ Queries with Failure                                  │          0 │
└───────────────────────────────────────────────────────┴────────────┘

Resource Usage

tpcds — base (merge-base)

Metric Value
Wall time 160.0s
Peak memory 6.3 GiB
Avg memory 5.3 GiB
CPU user 258.6s
CPU sys 17.2s
Peak spill 0 B

tpcds — branch

Metric Value
Wall time 155.0s
Peak memory 6.6 GiB
Avg memory 5.5 GiB
CPU user 211.3s
CPU sys 22.0s
Peak spill 0 B

File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

Benchmark for this request hit the 7200s job deadline before finishing.

Benchmarks requested: tpch

Kubernetes message
Job was active longer than specified deadline

File an issue against this benchmark runner

Each Parquet file previously produced a single morsel containing one
`ParquetPushDecoder` over the full pruned `ParquetAccessPlan`. Morselize
at row-group granularity instead: after all pruning work is done, pack
surviving row groups into chunks bounded by a per-morsel row budget and
compressed-byte budget (defaults: 100k rows, 64 MiB). Each chunk becomes
its own stream so the executor can interleave row-group decode work with
other operators and — in a follow-up — let sibling `FileStream`s steal
row-group-sized units of work across partitions.

A single oversized row group still becomes its own morsel; no
sub-row-group splitting is introduced.

`EarlyStoppingStream` (which is driven by the non-Clone `FilePruner`) is
attached only to the first morsel's stream so the whole file can still
short-circuit on dynamic-filter narrowing. Row-group reversal is applied
per-chunk on the `PreparedAccessPlan` and the chunk list is reversed so
reverse output order is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb
Copy link
Copy Markdown
Owner Author

run benchmarks

baseline:
    ref: main
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: false
       DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS: false
changed:
    ref: HEAD
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: true

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4289505736-1683-v7545 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (b64f2d9) to main diff using: tpch
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4289505736-1681-7qrkf 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (b64f2d9) to main diff using: clickbench_partitioned
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4289505736-1682-6cjpn 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (b64f2d9) to main diff using: tpcds
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and filter-pushdown-with-row-group-morsels
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃ filter-pushdown-with-row-group-morsels ┃         Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ QQuery 0  │          1.16 / 4.65 ±6.81 / 18.26 ms │           1.18 / 4.41 ±6.37 / 17.16 ms │  +1.05x faster │
│ QQuery 1  │        12.99 / 13.21 ±0.14 / 13.41 ms │         13.80 / 14.53 ±0.37 / 14.81 ms │   1.10x slower │
│ QQuery 2  │        38.02 / 38.28 ±0.29 / 38.81 ms │         38.16 / 38.24 ±0.07 / 38.36 ms │      no change │
│ QQuery 3  │        31.37 / 31.88 ±0.79 / 33.45 ms │         32.48 / 32.85 ±0.27 / 33.28 ms │      no change │
│ QQuery 4  │     240.67 / 248.77 ±4.75 / 253.72 ms │      255.51 / 259.32 ±3.49 / 264.79 ms │      no change │
│ QQuery 5  │     287.74 / 290.05 ±1.94 / 293.27 ms │      294.65 / 299.27 ±2.74 / 303.03 ms │      no change │
│ QQuery 6  │          6.54 / 8.05 ±2.19 / 12.38 ms │            5.20 / 6.37 ±0.88 / 7.46 ms │  +1.26x faster │
│ QQuery 7  │        14.12 / 14.24 ±0.19 / 14.62 ms │         15.53 / 15.74 ±0.14 / 15.96 ms │   1.11x slower │
│ QQuery 8  │     334.80 / 336.89 ±1.78 / 339.66 ms │      355.96 / 363.08 ±7.45 / 376.96 ms │   1.08x slower │
│ QQuery 9  │     510.42 / 522.60 ±9.98 / 537.52 ms │     513.64 / 520.80 ±10.52 / 541.68 ms │      no change │
│ QQuery 10 │        73.72 / 74.96 ±0.77 / 75.94 ms │       99.31 / 101.70 ±3.02 / 107.50 ms │   1.36x slower │
│ QQuery 11 │        85.60 / 86.10 ±0.26 / 86.32 ms │      109.35 / 110.13 ±0.60 / 110.78 ms │   1.28x slower │
│ QQuery 12 │     276.73 / 282.84 ±4.09 / 288.69 ms │      286.15 / 292.24 ±5.74 / 299.18 ms │      no change │
│ QQuery 13 │     401.49 / 409.51 ±6.27 / 419.23 ms │     456.29 / 467.17 ±15.60 / 497.57 ms │   1.14x slower │
│ QQuery 14 │     290.93 / 292.21 ±1.25 / 294.51 ms │      337.34 / 340.35 ±2.01 / 343.48 ms │   1.16x slower │
│ QQuery 15 │     293.68 / 298.35 ±3.50 / 303.27 ms │     316.23 / 335.43 ±15.93 / 358.38 ms │   1.12x slower │
│ QQuery 16 │     630.94 / 636.39 ±5.90 / 647.60 ms │      671.71 / 679.24 ±5.59 / 684.50 ms │   1.07x slower │
│ QQuery 17 │     630.32 / 638.13 ±5.76 / 646.13 ms │      666.18 / 674.89 ±5.36 / 680.67 ms │   1.06x slower │
│ QQuery 18 │  1266.68 / 1279.37 ±7.78 / 1290.36 ms │  1321.71 / 1343.32 ±15.48 / 1359.59 ms │      no change │
│ QQuery 19 │        28.68 / 30.16 ±2.37 / 34.87 ms │         30.03 / 31.33 ±1.28 / 33.76 ms │      no change │
│ QQuery 20 │     519.42 / 524.77 ±6.00 / 535.76 ms │      509.93 / 513.40 ±2.70 / 517.62 ms │      no change │
│ QQuery 21 │     591.65 / 596.23 ±3.92 / 602.73 ms │      577.65 / 583.32 ±4.70 / 588.69 ms │      no change │
│ QQuery 22 │ 1051.93 / 1068.46 ±12.08 / 1082.92 ms │      776.25 / 783.98 ±7.23 / 794.83 ms │  +1.36x faster │
│ QQuery 23 │ 3296.34 / 3330.05 ±24.49 / 3367.80 ms │     273.33 / 298.40 ±21.62 / 337.17 ms │ +11.16x faster │
│ QQuery 24 │        41.72 / 42.13 ±0.36 / 42.76 ms │         35.55 / 36.77 ±0.83 / 38.04 ms │  +1.15x faster │
│ QQuery 25 │     113.33 / 116.14 ±3.28 / 122.32 ms │      120.01 / 121.53 ±1.80 / 124.95 ms │      no change │
│ QQuery 26 │        42.05 / 43.95 ±1.96 / 47.36 ms │         60.82 / 63.04 ±1.80 / 65.14 ms │   1.43x slower │
│ QQuery 27 │    665.41 / 680.68 ±10.17 / 696.34 ms │      635.53 / 641.61 ±5.49 / 651.30 ms │  +1.06x faster │
│ QQuery 28 │  2992.83 / 3008.67 ±8.76 / 3018.18 ms │  2967.51 / 2985.99 ±12.09 / 3005.28 ms │      no change │
│ QQuery 29 │       42.38 / 48.16 ±11.28 / 70.72 ms │         44.46 / 48.56 ±3.22 / 54.08 ms │      no change │
│ QQuery 30 │    307.92 / 323.38 ±25.66 / 374.41 ms │      330.84 / 335.77 ±4.55 / 344.23 ms │      no change │
│ QQuery 31 │     301.47 / 308.46 ±3.65 / 312.17 ms │      333.77 / 343.10 ±7.92 / 353.68 ms │   1.11x slower │
│ QQuery 32 │  1001.79 / 1007.37 ±4.79 / 1015.93 ms │  1010.06 / 1021.84 ±11.17 / 1041.16 ms │      no change │
│ QQuery 33 │ 1415.20 / 1438.33 ±12.88 / 1452.08 ms │  1415.35 / 1433.93 ±12.52 / 1453.00 ms │      no change │
│ QQuery 34 │ 1444.49 / 1460.75 ±14.69 / 1484.94 ms │  1424.63 / 1449.54 ±15.08 / 1462.18 ms │      no change │
│ QQuery 35 │     298.06 / 307.11 ±6.91 / 318.62 ms │      307.38 / 313.73 ±7.46 / 328.11 ms │      no change │
│ QQuery 36 │        63.92 / 71.35 ±5.41 / 80.55 ms │         61.62 / 62.42 ±0.60 / 63.37 ms │  +1.14x faster │
│ QQuery 37 │        36.34 / 36.97 ±0.57 / 37.68 ms │         36.29 / 37.58 ±0.98 / 38.81 ms │      no change │
│ QQuery 38 │        44.00 / 47.91 ±3.40 / 52.55 ms │         36.81 / 39.25 ±1.82 / 41.39 ms │  +1.22x faster │
│ QQuery 39 │     131.64 / 138.40 ±4.61 / 143.91 ms │      119.71 / 121.87 ±3.00 / 127.77 ms │  +1.14x faster │
│ QQuery 40 │        14.88 / 15.13 ±0.14 / 15.29 ms │         20.14 / 21.13 ±1.16 / 22.75 ms │   1.40x slower │
│ QQuery 41 │        14.04 / 16.01 ±3.68 / 23.37 ms │         16.50 / 18.25 ±1.24 / 19.75 ms │   1.14x slower │
│ QQuery 42 │        13.61 / 13.93 ±0.33 / 14.58 ms │         14.16 / 15.00 ±0.45 / 15.51 ms │   1.08x slower │
└───────────┴───────────────────────────────────────┴────────────────────────────────────────┴────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                     │ 20180.99ms │
│ Total Time (filter-pushdown-with-row-group-morsels)   │ 17220.42ms │
│ Average Time (HEAD)                                   │   469.33ms │
│ Average Time (filter-pushdown-with-row-group-morsels) │   400.47ms │
│ Queries Faster                                        │          9 │
│ Queries Slower                                        │         15 │
│ Queries with No Change                                │         19 │
│ Queries with Failure                                  │          0 │
└───────────────────────────────────────────────────────┴────────────┘

Resource Usage

clickbench_partitioned — base (merge-base)

Metric Value
Wall time 105.0s
Peak memory 30.3 GiB
Avg memory 23.0 GiB
CPU user 1074.5s
CPU sys 59.7s
Peak spill 0 B

clickbench_partitioned — branch

Metric Value
Wall time 90.0s
Peak memory 39.0 GiB
Avg memory 32.5 GiB
CPU user 898.6s
CPU sys 72.7s
Peak spill 0 B

File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and filter-pushdown-with-row-group-morsels
--------------------
Benchmark tpcds_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                     HEAD ┃   filter-pushdown-with-row-group-morsels ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │              7.23 / 7.80 ±0.78 / 9.33 ms │              6.62 / 7.00 ±0.65 / 8.30 ms │ +1.11x faster │
│ QQuery 2  │        150.28 / 150.99 ±0.52 / 151.82 ms │        108.69 / 110.40 ±1.61 / 112.79 ms │ +1.37x faster │
│ QQuery 3  │        115.19 / 115.80 ±0.57 / 116.57 ms │        125.20 / 126.92 ±1.93 / 130.45 ms │  1.10x slower │
│ QQuery 4  │    1348.95 / 1386.14 ±34.60 / 1438.22 ms │    1083.14 / 1120.31 ±20.90 / 1143.38 ms │ +1.24x faster │
│ QQuery 5  │        173.85 / 175.50 ±1.70 / 178.77 ms │        176.48 / 178.76 ±1.26 / 180.22 ms │     no change │
│ QQuery 6  │       850.53 / 876.98 ±16.47 / 901.02 ms │        229.64 / 240.47 ±9.01 / 255.00 ms │ +3.65x faster │
│ QQuery 7  │        336.28 / 343.97 ±4.20 / 348.51 ms │        327.97 / 335.87 ±4.01 / 339.08 ms │     no change │
│ QQuery 8  │        112.28 / 113.34 ±1.05 / 115.21 ms │        134.40 / 140.09 ±3.80 / 144.09 ms │  1.24x slower │
│ QQuery 9  │        101.36 / 104.23 ±3.43 / 110.95 ms │        100.25 / 102.31 ±2.32 / 105.28 ms │     no change │
│ QQuery 10 │        101.89 / 102.06 ±0.14 / 102.32 ms │        136.27 / 143.57 ±5.75 / 152.10 ms │  1.41x slower │
│ QQuery 11 │        930.34 / 945.36 ±9.96 / 959.49 ms │        709.57 / 712.67 ±2.71 / 717.07 ms │ +1.33x faster │
│ QQuery 12 │           43.58 / 44.18 ±0.34 / 44.64 ms │           35.05 / 37.12 ±1.45 / 39.51 ms │ +1.19x faster │
│ QQuery 13 │        385.27 / 395.62 ±7.54 / 406.74 ms │        572.74 / 579.90 ±6.32 / 589.57 ms │  1.47x slower │
│ QQuery 14 │     1005.94 / 1012.48 ±4.67 / 1018.29 ms │        868.25 / 877.52 ±9.36 / 892.04 ms │ +1.15x faster │
│ QQuery 15 │           14.82 / 15.05 ±0.30 / 15.64 ms │           18.91 / 20.77 ±1.96 / 23.49 ms │  1.38x slower │
│ QQuery 16 │              7.26 / 7.42 ±0.17 / 7.76 ms │              7.36 / 8.24 ±0.77 / 9.65 ms │  1.11x slower │
│ QQuery 17 │        226.12 / 230.74 ±3.43 / 234.07 ms │        173.54 / 179.83 ±3.91 / 185.69 ms │ +1.28x faster │
│ QQuery 18 │        125.38 / 126.36 ±0.71 / 127.23 ms │        183.38 / 192.49 ±5.71 / 198.73 ms │  1.52x slower │
│ QQuery 19 │        160.06 / 162.11 ±2.33 / 166.46 ms │        142.21 / 144.92 ±1.64 / 146.91 ms │ +1.12x faster │
│ QQuery 20 │           13.47 / 13.68 ±0.26 / 14.15 ms │           16.26 / 17.05 ±0.76 / 18.15 ms │  1.25x slower │
│ QQuery 21 │           19.25 / 19.47 ±0.13 / 19.64 ms │           22.94 / 23.52 ±0.43 / 24.04 ms │  1.21x slower │
│ QQuery 22 │        474.08 / 486.23 ±9.43 / 502.66 ms │       493.20 / 510.69 ±14.84 / 533.97 ms │  1.05x slower │
│ QQuery 23 │       827.16 / 884.82 ±36.63 / 936.30 ms │        868.78 / 880.22 ±7.16 / 889.22 ms │     no change │
│ QQuery 24 │        399.46 / 402.67 ±4.12 / 410.74 ms │        121.32 / 125.19 ±3.66 / 132.10 ms │ +3.22x faster │
│ QQuery 25 │        346.83 / 348.37 ±1.33 / 350.19 ms │        282.62 / 291.36 ±4.44 / 294.92 ms │ +1.20x faster │
│ QQuery 26 │           79.13 / 79.53 ±0.33 / 80.10 ms │        149.38 / 154.49 ±6.14 / 166.16 ms │  1.94x slower │
│ QQuery 27 │              7.40 / 7.45 ±0.06 / 7.53 ms │              7.04 / 7.84 ±0.85 / 9.06 ms │  1.05x slower │
│ QQuery 28 │        154.37 / 155.01 ±0.99 / 156.98 ms │        150.68 / 153.69 ±2.27 / 156.40 ms │     no change │
│ QQuery 29 │        285.67 / 289.87 ±2.68 / 293.81 ms │        217.55 / 222.57 ±3.77 / 226.47 ms │ +1.30x faster │
│ QQuery 30 │           42.89 / 43.83 ±1.15 / 46.07 ms │           49.45 / 54.18 ±4.12 / 61.12 ms │  1.24x slower │
│ QQuery 31 │        170.45 / 171.86 ±1.61 / 174.65 ms │        168.49 / 173.17 ±2.98 / 177.16 ms │     no change │
│ QQuery 32 │           14.67 / 15.88 ±2.13 / 20.14 ms │           13.67 / 14.90 ±1.15 / 16.93 ms │ +1.07x faster │
│ QQuery 33 │        139.74 / 142.14 ±2.03 / 144.46 ms │        131.00 / 134.96 ±2.90 / 138.86 ms │ +1.05x faster │
│ QQuery 34 │              7.31 / 7.45 ±0.14 / 7.69 ms │              7.23 / 7.74 ±0.70 / 9.12 ms │     no change │
│ QQuery 35 │        102.55 / 104.40 ±1.18 / 105.63 ms │        108.70 / 110.03 ±1.10 / 111.93 ms │  1.05x slower │
│ QQuery 36 │              7.09 / 7.23 ±0.11 / 7.43 ms │              6.21 / 6.55 ±0.25 / 6.93 ms │ +1.10x faster │
│ QQuery 37 │              8.76 / 8.83 ±0.06 / 8.94 ms │              4.97 / 5.23 ±0.19 / 5.51 ms │ +1.69x faster │
│ QQuery 38 │           90.47 / 92.51 ±1.72 / 94.46 ms │          90.35 / 97.85 ±4.92 / 105.70 ms │  1.06x slower │
│ QQuery 39 │        119.23 / 129.19 ±6.18 / 138.69 ms │        133.58 / 137.97 ±5.01 / 147.48 ms │  1.07x slower │
│ QQuery 40 │        103.41 / 110.78 ±6.91 / 121.55 ms │       117.20 / 123.99 ±10.84 / 145.44 ms │  1.12x slower │
│ QQuery 41 │           13.99 / 14.24 ±0.21 / 14.59 ms │           15.34 / 16.49 ±0.72 / 17.42 ms │  1.16x slower │
│ QQuery 42 │        108.17 / 109.67 ±1.12 / 111.12 ms │        110.20 / 113.81 ±2.69 / 117.44 ms │     no change │
│ QQuery 43 │              5.62 / 5.73 ±0.13 / 5.98 ms │              5.32 / 5.50 ±0.20 / 5.90 ms │     no change │
│ QQuery 44 │           11.74 / 12.72 ±1.71 / 16.13 ms │           11.41 / 11.64 ±0.16 / 11.87 ms │ +1.09x faster │
│ QQuery 45 │           48.90 / 49.14 ±0.14 / 49.34 ms │           43.64 / 44.98 ±1.10 / 46.73 ms │ +1.09x faster │
│ QQuery 46 │             8.63 / 9.11 ±0.70 / 10.49 ms │              8.09 / 8.60 ±0.66 / 9.88 ms │ +1.06x faster │
│ QQuery 47 │       702.02 / 743.97 ±23.21 / 766.00 ms │       689.69 / 740.90 ±27.00 / 764.27 ms │     no change │
│ QQuery 48 │        274.61 / 281.02 ±5.17 / 290.36 ms │        349.67 / 365.00 ±9.82 / 379.76 ms │  1.30x slower │
│ QQuery 49 │        249.66 / 251.79 ±1.15 / 253.05 ms │        241.63 / 244.58 ±1.68 / 246.47 ms │     no change │
│ QQuery 50 │        203.89 / 214.39 ±7.04 / 225.58 ms │        245.15 / 257.87 ±7.87 / 267.07 ms │  1.20x slower │
│ QQuery 51 │        176.97 / 182.99 ±4.12 / 187.90 ms │        207.98 / 211.46 ±2.52 / 215.42 ms │  1.16x slower │
│ QQuery 52 │        106.08 / 107.81 ±1.65 / 110.26 ms │        107.64 / 110.52 ±2.44 / 114.03 ms │     no change │
│ QQuery 53 │        102.53 / 104.57 ±1.92 / 108.17 ms │        141.61 / 145.47 ±2.81 / 149.06 ms │  1.39x slower │
│ QQuery 54 │        144.43 / 148.02 ±3.00 / 152.03 ms │        130.32 / 134.01 ±3.24 / 138.98 ms │ +1.10x faster │
│ QQuery 55 │        107.79 / 108.30 ±0.47 / 108.88 ms │        108.73 / 111.90 ±2.76 / 116.60 ms │     no change │
│ QQuery 56 │        142.05 / 143.48 ±1.23 / 145.28 ms │        133.98 / 135.87 ±1.53 / 137.38 ms │ +1.06x faster │
│ QQuery 57 │        167.93 / 171.00 ±1.80 / 173.54 ms │        186.52 / 189.55 ±1.76 / 191.74 ms │  1.11x slower │
│ QQuery 58 │        317.87 / 318.63 ±0.80 / 320.03 ms │        227.26 / 232.41 ±3.01 / 236.19 ms │ +1.37x faster │
│ QQuery 59 │        199.16 / 205.50 ±5.82 / 212.60 ms │        234.87 / 241.42 ±4.32 / 247.26 ms │  1.17x slower │
│ QQuery 60 │        140.99 / 142.64 ±2.32 / 147.14 ms │        140.24 / 140.65 ±0.59 / 141.82 ms │     no change │
│ QQuery 61 │           13.21 / 13.51 ±0.27 / 13.85 ms │           12.65 / 13.12 ±0.76 / 14.62 ms │     no change │
│ QQuery 62 │       874.78 / 886.86 ±11.40 / 908.55 ms │     970.45 / 1013.12 ±39.69 / 1074.88 ms │  1.14x slower │
│ QQuery 63 │        102.61 / 104.04 ±2.15 / 108.30 ms │        146.94 / 149.82 ±2.48 / 154.17 ms │  1.44x slower │
│ QQuery 64 │        667.32 / 680.72 ±7.69 / 690.62 ms │       733.08 / 759.94 ±18.00 / 781.73 ms │  1.12x slower │
│ QQuery 65 │       250.33 / 263.24 ±10.14 / 275.06 ms │        331.60 / 340.23 ±7.94 / 349.79 ms │  1.29x slower │
│ QQuery 66 │       213.41 / 232.05 ±14.12 / 247.68 ms │        180.04 / 190.77 ±7.08 / 198.55 ms │ +1.22x faster │
│ QQuery 67 │       292.70 / 304.76 ±18.81 / 342.13 ms │       473.79 / 498.59 ±18.68 / 522.88 ms │  1.64x slower │
│ QQuery 68 │              8.46 / 8.68 ±0.21 / 9.06 ms │           10.66 / 12.46 ±1.93 / 15.90 ms │  1.44x slower │
│ QQuery 69 │         98.09 / 102.07 ±4.34 / 110.33 ms │        134.04 / 142.69 ±4.65 / 147.89 ms │  1.40x slower │
│ QQuery 70 │       308.74 / 320.65 ±12.04 / 342.93 ms │        388.81 / 397.61 ±4.88 / 403.20 ms │  1.24x slower │
│ QQuery 71 │        133.44 / 138.01 ±5.41 / 148.28 ms │        126.51 / 130.31 ±2.23 / 132.83 ms │ +1.06x faster │
│ QQuery 72 │       591.54 / 612.13 ±12.42 / 626.58 ms │        487.00 / 488.13 ±1.59 / 491.24 ms │ +1.25x faster │
│ QQuery 73 │              6.65 / 6.79 ±0.21 / 7.21 ms │              7.04 / 7.35 ±0.20 / 7.64 ms │  1.08x slower │
│ QQuery 74 │       559.42 / 600.42 ±33.53 / 646.88 ms │        505.62 / 521.53 ±8.66 / 531.75 ms │ +1.15x faster │
│ QQuery 75 │        267.05 / 271.29 ±3.25 / 275.78 ms │        275.56 / 281.33 ±4.85 / 287.18 ms │     no change │
│ QQuery 76 │        130.20 / 134.11 ±5.45 / 144.78 ms │        158.57 / 160.54 ±1.27 / 162.29 ms │  1.20x slower │
│ QQuery 77 │        189.10 / 191.19 ±1.86 / 194.41 ms │        204.70 / 207.74 ±2.70 / 212.55 ms │  1.09x slower │
│ QQuery 78 │        338.33 / 344.53 ±4.13 / 349.16 ms │        307.80 / 313.71 ±7.19 / 327.86 ms │ +1.10x faster │
│ QQuery 79 │        228.00 / 230.61 ±2.67 / 235.69 ms │        258.23 / 266.56 ±7.58 / 279.87 ms │  1.16x slower │
│ QQuery 80 │        321.36 / 323.53 ±1.59 / 325.09 ms │       258.99 / 278.46 ±10.68 / 290.02 ms │ +1.16x faster │
│ QQuery 81 │           25.27 / 25.53 ±0.21 / 25.78 ms │           29.94 / 32.70 ±1.63 / 34.66 ms │  1.28x slower │
│ QQuery 82 │           39.24 / 39.86 ±0.36 / 40.28 ms │           45.27 / 46.66 ±0.70 / 47.06 ms │  1.17x slower │
│ QQuery 83 │           36.56 / 36.93 ±0.26 / 37.31 ms │           42.64 / 44.37 ±1.06 / 45.35 ms │  1.20x slower │
│ QQuery 84 │           45.87 / 46.29 ±0.40 / 46.93 ms │           50.79 / 52.93 ±1.10 / 53.93 ms │  1.14x slower │
│ QQuery 85 │        140.16 / 140.96 ±0.76 / 142.21 ms │        210.04 / 218.89 ±5.73 / 228.10 ms │  1.55x slower │
│ QQuery 86 │           38.63 / 39.29 ±0.58 / 40.27 ms │           42.68 / 43.71 ±0.79 / 45.10 ms │  1.11x slower │
│ QQuery 87 │              3.61 / 3.72 ±0.13 / 3.97 ms │          92.44 / 98.42 ±5.48 / 107.55 ms │ 26.44x slower │
│ QQuery 88 │        102.07 / 104.34 ±2.53 / 108.74 ms │        118.50 / 119.65 ±0.69 / 120.65 ms │  1.15x slower │
│ QQuery 89 │        118.45 / 119.94 ±1.27 / 122.07 ms │        143.39 / 151.57 ±6.05 / 160.76 ms │  1.26x slower │
│ QQuery 90 │           23.28 / 23.62 ±0.19 / 23.85 ms │           24.02 / 25.57 ±0.93 / 26.86 ms │  1.08x slower │
│ QQuery 91 │           60.40 / 62.32 ±2.50 / 67.26 ms │        103.81 / 108.69 ±4.87 / 117.73 ms │  1.74x slower │
│ QQuery 92 │           58.08 / 58.59 ±0.53 / 59.42 ms │           61.67 / 62.15 ±0.65 / 63.41 ms │  1.06x slower │
│ QQuery 93 │        189.94 / 192.39 ±2.03 / 195.76 ms │        184.55 / 193.36 ±6.34 / 200.59 ms │     no change │
│ QQuery 94 │           62.71 / 63.09 ±0.28 / 63.46 ms │           68.89 / 70.68 ±1.40 / 73.08 ms │  1.12x slower │
│ QQuery 95 │        129.62 / 130.76 ±0.95 / 132.47 ms │        121.65 / 126.33 ±5.37 / 136.59 ms │     no change │
│ QQuery 96 │           68.80 / 70.72 ±1.19 / 71.83 ms │           84.51 / 88.27 ±2.76 / 92.97 ms │  1.25x slower │
│ QQuery 97 │        122.21 / 123.63 ±0.93 / 124.58 ms │        150.39 / 156.05 ±4.11 / 161.52 ms │  1.26x slower │
│ QQuery 98 │        156.88 / 160.39 ±1.98 / 162.97 ms │        114.09 / 118.18 ±2.39 / 120.94 ms │ +1.36x faster │
│ QQuery 99 │ 10909.19 / 10996.41 ±70.05 / 11093.29 ms │ 10909.25 / 10973.54 ±53.38 / 11031.61 ms │     no change │
└───────────┴──────────────────────────────────────────┴──────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                     │ 31370.06ms │
│ Total Time (filter-pushdown-with-row-group-morsels)   │ 30880.68ms │
│ Average Time (HEAD)                                   │   316.87ms │
│ Average Time (filter-pushdown-with-row-group-morsels) │   311.93ms │
│ Queries Faster                                        │         29 │
│ Queries Slower                                        │         51 │
│ Queries with No Change                                │         19 │
│ Queries with Failure                                  │          0 │
└───────────────────────────────────────────────────────┴────────────┘

Resource Usage

tpcds — base (merge-base)

Metric Value
Wall time 160.0s
Peak memory 6.3 GiB
Avg memory 5.6 GiB
CPU user 263.9s
CPU sys 8.7s
Peak spill 0 B

tpcds — branch

Metric Value
Wall time 155.0s
Peak memory 6.5 GiB
Avg memory 5.4 GiB
CPU user 215.4s
CPU sys 23.0s
Peak spill 0 B

File an issue against this benchmark runner

@adriangb
Copy link
Copy Markdown
Owner Author

run benchmark clickbench_partitioned

baseline:
    ref: main
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: true
       DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS: true
changed:
    ref: HEAD
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: true

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4289709612-1685-jb5wr 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (28ebd52) to main diff using: clickbench_partitioned
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and filter-pushdown-with-row-group-morsels
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃ filter-pushdown-with-row-group-morsels ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │          1.17 / 4.70 ±6.90 / 18.51 ms │           1.18 / 4.47 ±6.43 / 17.32 ms │     no change │
│ QQuery 1  │        12.71 / 13.03 ±0.21 / 13.28 ms │         15.42 / 16.32 ±0.53 / 16.86 ms │  1.25x slower │
│ QQuery 2  │        37.47 / 38.41 ±0.87 / 40.04 ms │         41.82 / 42.20 ±0.45 / 43.05 ms │  1.10x slower │
│ QQuery 3  │        32.12 / 33.95 ±1.49 / 35.85 ms │         34.15 / 34.28 ±0.18 / 34.63 ms │     no change │
│ QQuery 4  │     248.60 / 250.92 ±2.04 / 253.88 ms │      260.51 / 268.08 ±8.05 / 282.65 ms │  1.07x slower │
│ QQuery 5  │     288.64 / 291.35 ±2.57 / 295.39 ms │      305.57 / 310.71 ±5.56 / 319.49 ms │  1.07x slower │
│ QQuery 6  │           5.96 / 6.30 ±0.31 / 6.76 ms │           5.83 / 8.96 ±4.48 / 17.71 ms │  1.42x slower │
│ QQuery 7  │        16.37 / 16.70 ±0.19 / 16.94 ms │         17.46 / 17.61 ±0.16 / 17.89 ms │  1.05x slower │
│ QQuery 8  │     338.72 / 340.12 ±1.45 / 342.43 ms │      368.50 / 372.14 ±4.40 / 380.34 ms │  1.09x slower │
│ QQuery 9  │     507.84 / 519.56 ±9.30 / 529.41 ms │      524.28 / 533.67 ±5.14 / 538.77 ms │     no change │
│ QQuery 10 │       99.39 / 99.86 ±0.58 / 100.94 ms │      101.84 / 104.98 ±3.12 / 109.74 ms │  1.05x slower │
│ QQuery 11 │     108.66 / 109.61 ±0.67 / 110.31 ms │      113.19 / 113.92 ±0.77 / 115.24 ms │     no change │
│ QQuery 12 │     316.41 / 319.26 ±2.18 / 322.27 ms │     294.70 / 305.41 ±10.63 / 324.24 ms │     no change │
│ QQuery 13 │    441.74 / 461.62 ±16.00 / 488.41 ms │     465.02 / 489.68 ±19.02 / 513.22 ms │  1.06x slower │
│ QQuery 14 │     328.66 / 335.32 ±6.95 / 344.50 ms │      340.35 / 345.12 ±3.11 / 349.13 ms │     no change │
│ QQuery 15 │     295.27 / 298.01 ±3.16 / 304.03 ms │     328.98 / 347.52 ±16.27 / 377.75 ms │  1.17x slower │
│ QQuery 16 │     635.23 / 640.71 ±3.23 / 644.30 ms │      688.20 / 691.79 ±3.31 / 698.05 ms │  1.08x slower │
│ QQuery 17 │     635.40 / 652.16 ±9.09 / 659.71 ms │      689.89 / 692.70 ±2.41 / 696.48 ms │  1.06x slower │
│ QQuery 18 │ 1297.48 / 1313.87 ±13.50 / 1329.99 ms │  1366.82 / 1427.81 ±44.46 / 1479.28 ms │  1.09x slower │
│ QQuery 19 │        30.97 / 31.30 ±0.49 / 32.27 ms │        32.69 / 46.65 ±17.08 / 75.16 ms │  1.49x slower │
│ QQuery 20 │    519.75 / 531.80 ±11.20 / 551.88 ms │     527.47 / 534.24 ±10.38 / 554.75 ms │     no change │
│ QQuery 21 │     573.45 / 582.17 ±7.71 / 594.49 ms │      584.27 / 590.08 ±3.00 / 592.46 ms │     no change │
│ QQuery 22 │    934.30 / 952.36 ±17.52 / 975.32 ms │      783.86 / 795.84 ±7.71 / 807.33 ms │ +1.20x faster │
│ QQuery 23 │     114.85 / 125.23 ±7.01 / 134.99 ms │     290.27 / 312.90 ±19.22 / 347.25 ms │  2.50x slower │
│ QQuery 24 │        42.50 / 43.35 ±0.55 / 44.13 ms │         36.24 / 38.79 ±1.81 / 41.85 ms │ +1.12x faster │
│ QQuery 25 │     149.41 / 151.92 ±2.29 / 154.99 ms │      121.50 / 122.41 ±0.91 / 123.87 ms │ +1.24x faster │
│ QQuery 26 │        62.48 / 63.41 ±0.88 / 64.99 ms │         63.12 / 64.85 ±1.37 / 66.94 ms │     no change │
│ QQuery 27 │    718.46 / 730.30 ±10.06 / 744.30 ms │      645.44 / 648.76 ±3.97 / 655.29 ms │ +1.13x faster │
│ QQuery 28 │  3045.70 / 3057.58 ±7.97 / 3068.48 ms │   3036.86 / 3043.54 ±5.73 / 3051.01 ms │     no change │
│ QQuery 29 │      43.28 / 59.64 ±23.51 / 103.87 ms │         47.33 / 51.68 ±2.93 / 55.48 ms │ +1.15x faster │
│ QQuery 30 │     320.83 / 325.80 ±4.91 / 334.48 ms │      334.00 / 341.20 ±8.26 / 357.17 ms │     no change │
│ QQuery 31 │     319.28 / 324.84 ±6.06 / 336.47 ms │      336.22 / 343.05 ±5.58 / 351.05 ms │  1.06x slower │
│ QQuery 32 │ 1023.84 / 1036.55 ±14.61 / 1064.57 ms │  1044.54 / 1057.65 ±19.91 / 1097.01 ms │     no change │
│ QQuery 33 │ 1469.46 / 1497.79 ±19.30 / 1525.88 ms │  1449.80 / 1473.57 ±19.33 / 1496.72 ms │     no change │
│ QQuery 34 │ 1468.21 / 1502.01 ±23.21 / 1536.74 ms │  1514.52 / 1527.73 ±10.58 / 1546.67 ms │     no change │
│ QQuery 35 │    296.59 / 318.29 ±24.90 / 352.69 ms │      330.62 / 335.43 ±5.12 / 345.33 ms │  1.05x slower │
│ QQuery 36 │        62.78 / 73.21 ±7.05 / 83.44 ms │         63.08 / 66.25 ±3.18 / 72.30 ms │ +1.10x faster │
│ QQuery 37 │        41.28 / 42.42 ±1.53 / 45.27 ms │         36.50 / 37.95 ±1.43 / 40.62 ms │ +1.12x faster │
│ QQuery 38 │        35.70 / 39.18 ±2.06 / 41.69 ms │         36.96 / 40.45 ±2.19 / 43.51 ms │     no change │
│ QQuery 39 │     119.80 / 131.21 ±6.54 / 138.35 ms │      120.18 / 123.14 ±2.55 / 126.71 ms │ +1.07x faster │
│ QQuery 40 │        18.95 / 19.90 ±1.28 / 22.44 ms │         20.73 / 23.24 ±2.11 / 26.94 ms │  1.17x slower │
│ QQuery 41 │        17.64 / 19.08 ±1.54 / 21.95 ms │         16.99 / 18.38 ±0.80 / 19.25 ms │     no change │
│ QQuery 42 │        14.92 / 15.10 ±0.15 / 15.37 ms │         16.17 / 16.35 ±0.21 / 16.72 ms │  1.08x slower │
└───────────┴───────────────────────────────────────┴────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                     │ 17419.92ms │
│ Total Time (filter-pushdown-with-row-group-morsels)   │ 17781.51ms │
│ Average Time (HEAD)                                   │   405.11ms │
│ Average Time (filter-pushdown-with-row-group-morsels) │   413.52ms │
│ Queries Faster                                        │          8 │
│ Queries Slower                                        │         19 │
│ Queries with No Change                                │         16 │
│ Queries with Failure                                  │          0 │
└───────────────────────────────────────────────────────┴────────────┘

Resource Usage

clickbench_partitioned — base (merge-base)

Metric Value
Wall time 90.0s
Peak memory 30.2 GiB
Avg memory 22.9 GiB
CPU user 924.1s
CPU sys 55.4s
Peak spill 0 B

clickbench_partitioned — branch

Metric Value
Wall time 90.0s
Peak memory 38.1 GiB
Avg memory 28.2 GiB
CPU user 921.8s
CPU sys 80.7s
Peak spill 0 B

File an issue against this benchmark runner

@adriangb
Copy link
Copy Markdown
Owner Author

Regression diagnosis: it's the adaptive tracker itself, not the placement

TL;DR: On the main regressed queries, forcing every filter to a static placement (either all-PostScan or all-RowFilter) is 20–50 ms faster than the adaptive path. The tracker's per-morsel partition_filters call and per-batch SelectivityStats::update are adding more overhead than the adaptive decision is saving.

Local A/B with four placement strategies (5 iterations, M-series)

Knobs used on the branch binary + --pushdown:

  • adaptive — defaults (filter_collecting_byte_ratio_threshold=0.20, filter_pushdown_min_bytes_per_sec=104857600)
  • all-postscanbyte_ratio_threshold=0.0 + min_bytes_per_sec=1e18 (every filter stays PostScan forever)
  • all-rowfilterbyte_ratio_threshold=10.0 + min_bytes_per_sec=1e18 (every filter stays RowFilter forever)
  • main+off — reference: apache/main with pushdown=off
Query adaptive all-postscan all-rowfilter main+off
Q10 97.05 70.20 70.73 78.92
Q11 102.28 85.03 79.36 80.26
Q14 365.03 312.54 313.04 318.98
Q26 57.57 51.39 54.92 41.68
Q40 13.40 10.74 14.18 11.19

For Q10, Q14, Q40 — all-postscan is faster than apache/main with pushdown=off. The branch's "worst case" placement is already better than the no-pushdown baseline. The adaptive path is just wasting CPU on top of that.

Where the overhead is

Two sources, both on the hot path:

  1. SelectivityTracker::partition_filters fires per morsel. Each call takes the tracker's inner mutex, reads filter_stats under a parking_lot RwLock, iterates all conjuncts, computes byte_ratio (walking row groups), and updates filter_states. At ~50-100µs per call × ~300 morsels (100 files × 2-3 chunks each) = 15-30ms on the critical path, contested across partitions. This is the cost we paid to "let stats flow between morsels within a file" and it's real — Q14 also benefits from it (it would be 700+ ms without per-morsel placement).

  2. apply_post_scan_filters_with_stats::tracker.update runs per batch. Each batch locks filter_stats (RwLock read) then the per-filter Mutex, does Welford math, and checks the skip-flag gate. For a ~1M-row file at 8k-row batches, that's ~125 lock pairs per file × 100 files = ~12,500 lock-pair acquisitions per query.

What does this tell us

Adaptive currently beats pure static in the one spot it was designed for — hash-join dynamic filters like Q23 (3.3 s → 298 ms, still 11× faster than main+off). But on ClickBench user-written filters, static PostScan is a very hard baseline to beat, and our per-morsel/per-batch bookkeeping consistently loses to it.

Possible fixes, smallest → largest

  1. Reduce partition_filters frequency. Cache the placement per-file in LazyMorselShared after the first morsel's decision; re-query the tracker only every N morsels or when stats move by Δ. Keeps the "feedback within a file" property for long files but cuts 80% of lock acquisitions.
  2. Batch SelectivityStats::update. Accumulate per-batch counters on the morsel's stack, flush to the shared stats once per morsel (or on Drop). Removes per-batch locking entirely.
  3. Skip the update path for finalized filters. If min_bytes_per_sec is infinity (default off) OR the filter has been in the same state for N consecutive flushes with stable CI, stop updating it — it's not going to move.
  4. Skip the tracker when the query isn't going to adapt. If we detect at build_stream time that there are no dynamic/optional filters and the user's pushdown mode is "on", we could short-circuit to plain row-filter for small-byte-ratio filters (like the old reorder_filters code), skipping the tracker entirely.

Happy to prototype one of these — (2) looks cheapest and most impactful since it removes the per-batch locks without changing any placement behavior. (4) would structurally recover the old code path for the common case, at the cost of "no adaptation unless we know adaptation is needed".

On the morsel split itself

No-filter queries (Q15/Q16/Q17) also regress ~5–10% at pushdown=off, which is independent of everything above — it's the morsel-split fan-out cost (multiple decoders + readers + projectors per file). Setting morsel_max_rows=u64::MAX (1 chunk/file) doesn't recover it fully either, so some overhead is in the lazy-morsel code path itself, not the fan-out. Smaller effect than the tracker overhead, but worth a separate pass.

🤖 Generated with Claude Code

@adriangbot
Copy link
Copy Markdown

Benchmark for this request hit the 7200s job deadline before finishing.

Benchmarks requested: tpch

Kubernetes message
Job was active longer than specified deadline

File an issue against this benchmark runner

@adriangb
Copy link
Copy Markdown
Owner Author

Isolation experiment: where does the overhead actually live?

Built the three strata separately, then 10-iter 4-way on ClickBench-partitioned (local M-series):

Binary Content
main apache/main @ 9a1ed57859
lazy-split PR #10's morsel commit + ParquetLazyMorsel refactor without any adaptive-filter code (new branch morsel-split-lazy)
full this PR branch before today's fast-path commit
full-opt this PR branch + today's "skip adaptive bookkeeping when post-scan is empty" commit

pushdown=off

Query main lazy-split full full-opt
Q14 325 313 311 304
Q15 287 276 310 320
Q16 757 743 831 815
Q24 43 46 47 51
Q26 41 49 44 45

pushdown=on

Query main lazy-split full full-opt
Q10 91 96 87 87
Q11 99 98 96 96
Q14 354 348 343 348
Q24 41 40 31 35
Q26 56 60 57 54
Q40 14 13 13 13

Findings

  1. Eager morsel split (PR feat: split Parquet files into row-group-sized morsels #10 as-is) is measurably slower than main on some queries (+15% on Q15/Q16 at pushdown=off). Root cause: packing every chunk's decoder setup (build_row_filter + ParquetPushDecoderBuilder::build + create_reader) into one burst inside build_stream, all on the scheduler thread before any morsel starts executing.
  2. Lazy morsels alone recover it: lazy-split matches or beats main on every query above, with no adaptive machinery in sight. The win comes from moving that same per-chunk setup into each morsel's into_stream, so the CPU cost is distributed across the worker pool instead of serialised on the planner.
  3. Adaptive, at pushdown=on, is at parity with or beats main on every tested query. The "20–50 ms regressions" we saw a few runs ago look like they were largely variance + the older eager-morsel code. The earlier all-postscan experiment still stands as evidence that placement decisions matter, but the tracker data structures themselves are not the cost source.
  4. Microbench separately confirms (2): SelectivityTracker::update is 9.9 ns/call, partition_filters is 40 ns warm, and a full-query simulation (100 files × 3 morsels × 60 batches × 3 filters) is 521 µs total. The tracker itself is an order of magnitude below the noise floor of the full query.

What changed in full-opt

ParquetLazyMorsel::build_stream_now used to unconditionally allocate Vec<Arc<PhysicalExpr>> for the read-plan expression set and iterate the (empty) post-scan list to precompute per-filter byte rates. Both paths now take a fast branch when post_scan_conjuncts.is_empty() — which covers pushdown=off and any query without filters — passing the projection's expr_iter() directly to build_projection_read_plan and skipping the two empty loops. Same behaviour, tighter per-morsel path. Pushed as af2a26ff15.

Next

Given that lazy-split is a clean win over PR #10 and adaptive is at parity with main at pushdown=on, I'd suggest:

🤖 Generated with Claude Code

@adriangb
Copy link
Copy Markdown
Owner Author

run benchmark clickbench_partitioned

baseline:
    ref: main
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: true
       DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS: true
changed:
    ref: HEAD
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: true

@adriangb
Copy link
Copy Markdown
Owner Author

run benchmark clickbench_partitioned

baseline:
    ref: main
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: false
       DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS: true
changed:
    ref: HEAD
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: true

The previous `build_stream` built every morsel's `RowFilter`,
`ParquetPushDecoder`, `AsyncFileReader`, and `Projector` eagerly in a
single loop inside the file planner — before any morsel was scheduled.
That loop ran on the scheduler thread and was visible as a 10–15%
regression vs. main on ClickBench-partitioned queries that have many
row-group morsels per file (e.g. Q15, Q16 at pushdown=off).

Replace `ParquetStreamMorsel` (which held a pre-built `BoxStream`) with
`ParquetLazyMorsel`, which holds only the per-chunk `ParquetAccessPlan`
plus an `Arc<LazyMorselShared>` of the file-level state. The decoder
and reader are constructed inside `Morsel::into_stream`, so each
morsel pays its setup cost only when the scheduler actually picks it
up, and the work is distributed across worker threads instead of
serialised on the planner.

`FilePruner` is `!Clone` and drives whole-file early-stop via
`EarlyStoppingStream`, so it still lives on chunk 0's morsel only.
The warm `async_file_reader` from metadata / page-index / bloom-filter
load is dropped at the end of `build_stream` — every morsel mints a
fresh reader via the factory at `into_stream` time. For both built-in
factories (`DefaultParquetFileReaderFactory`,
`CachedParquetFileReaderFactory`) the "warm cache" benefit of reusing
a reader is negligible because the underlying `Arc<dyn ObjectStore>` /
`Arc<dyn FileMetadataCache>` is already shared across readers, so the
simplification is free.

Local ClickBench-partitioned, 10 iterations, pushdown=off (M-series):

| Query | main  | eager (before) | lazy (this commit) |
|-------|------:|---------------:|-------------------:|
| Q14   |  325  | 335            |             313 ms |
| Q15   |  309  | 358            |             302 ms |
| Q16   |  911  | 1049           |             786 ms |
| Q24   |   48  | 55             |              56 ms |
| Q26   |   41  | 45             |              45 ms |

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangbot
Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4290624868-1690-457rs 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (af2a26f) to main diff using: clickbench_partitioned
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4290625502-1691-pxdwg 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing HEAD (af2a26f) to main diff using: clickbench_partitioned
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and filter-pushdown-with-row-group-morsels
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃ filter-pushdown-with-row-group-morsels ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │          1.21 / 4.75 ±6.93 / 18.61 ms │           1.22 / 4.55 ±6.50 / 17.54 ms │     no change │
│ QQuery 1  │        12.99 / 13.35 ±0.22 / 13.66 ms │         14.91 / 15.47 ±0.35 / 15.92 ms │  1.16x slower │
│ QQuery 2  │        38.52 / 39.10 ±0.63 / 40.31 ms │         38.82 / 39.40 ±0.42 / 39.90 ms │     no change │
│ QQuery 3  │        32.19 / 33.15 ±0.94 / 34.93 ms │         34.27 / 34.82 ±0.42 / 35.46 ms │  1.05x slower │
│ QQuery 4  │     249.87 / 256.60 ±4.70 / 262.25 ms │      274.80 / 281.43 ±6.46 / 291.98 ms │  1.10x slower │
│ QQuery 5  │     294.50 / 296.77 ±1.79 / 299.89 ms │      311.86 / 321.94 ±7.94 / 331.95 ms │  1.08x slower │
│ QQuery 6  │           5.92 / 6.37 ±0.27 / 6.75 ms │            5.52 / 6.21 ±0.45 / 6.86 ms │     no change │
│ QQuery 7  │        16.81 / 16.98 ±0.12 / 17.10 ms │         16.97 / 17.26 ±0.21 / 17.59 ms │     no change │
│ QQuery 8  │     340.08 / 344.42 ±2.27 / 346.49 ms │      390.62 / 400.34 ±8.92 / 413.61 ms │  1.16x slower │
│ QQuery 9  │     523.08 / 536.11 ±6.83 / 542.70 ms │      550.68 / 559.28 ±8.42 / 573.66 ms │     no change │
│ QQuery 10 │     100.64 / 101.55 ±0.72 / 102.76 ms │      103.66 / 106.26 ±4.12 / 114.38 ms │     no change │
│ QQuery 11 │     111.11 / 114.12 ±2.59 / 118.81 ms │      114.94 / 116.71 ±1.29 / 118.22 ms │     no change │
│ QQuery 12 │     320.23 / 326.90 ±6.20 / 338.19 ms │      302.34 / 314.17 ±6.67 / 320.93 ms │     no change │
│ QQuery 13 │     445.11 / 454.00 ±6.66 / 463.24 ms │     484.52 / 505.03 ±21.82 / 546.40 ms │  1.11x slower │
│ QQuery 14 │     332.84 / 336.63 ±5.81 / 348.20 ms │      356.39 / 362.17 ±3.10 / 365.32 ms │  1.08x slower │
│ QQuery 15 │     298.53 / 307.33 ±5.99 / 316.04 ms │     351.05 / 369.63 ±24.07 / 416.28 ms │  1.20x slower │
│ QQuery 16 │     646.27 / 657.15 ±6.69 / 667.35 ms │     712.39 / 730.43 ±13.58 / 749.00 ms │  1.11x slower │
│ QQuery 17 │     649.83 / 654.65 ±4.88 / 662.50 ms │      722.65 / 729.05 ±5.15 / 734.53 ms │  1.11x slower │
│ QQuery 18 │ 1309.69 / 1333.06 ±22.98 / 1365.67 ms │  1413.09 / 1518.09 ±62.82 / 1604.56 ms │  1.14x slower │
│ QQuery 19 │        31.55 / 32.17 ±0.53 / 32.95 ms │         32.76 / 36.85 ±6.78 / 50.26 ms │  1.15x slower │
│ QQuery 20 │     523.06 / 534.32 ±8.23 / 547.72 ms │      530.53 / 534.93 ±4.92 / 542.21 ms │     no change │
│ QQuery 21 │    577.24 / 590.01 ±10.65 / 604.43 ms │      604.80 / 614.38 ±8.17 / 629.11 ms │     no change │
│ QQuery 22 │     945.92 / 957.49 ±7.37 / 966.35 ms │      822.05 / 836.06 ±8.54 / 847.07 ms │ +1.15x faster │
│ QQuery 23 │     115.26 / 121.56 ±6.29 / 133.39 ms │     302.06 / 332.95 ±22.15 / 357.86 ms │  2.74x slower │
│ QQuery 24 │        42.13 / 48.92 ±9.08 / 66.75 ms │         36.46 / 38.81 ±1.42 / 40.36 ms │ +1.26x faster │
│ QQuery 25 │     150.11 / 153.02 ±2.47 / 157.52 ms │      123.52 / 125.40 ±1.76 / 127.64 ms │ +1.22x faster │
│ QQuery 26 │        62.89 / 68.03 ±4.70 / 73.71 ms │         61.88 / 64.96 ±1.86 / 67.00 ms │     no change │
│ QQuery 27 │     731.61 / 736.25 ±5.21 / 744.91 ms │      654.35 / 657.61 ±4.27 / 665.88 ms │ +1.12x faster │
│ QQuery 28 │ 3097.90 / 3136.43 ±21.35 / 3158.37 ms │  3065.09 / 3119.23 ±34.64 / 3157.66 ms │     no change │
│ QQuery 29 │        44.16 / 44.62 ±0.47 / 45.37 ms │         47.29 / 52.38 ±4.19 / 58.37 ms │  1.17x slower │
│ QQuery 30 │     343.29 / 351.25 ±8.98 / 367.69 ms │      361.55 / 367.97 ±5.57 / 378.28 ms │     no change │
│ QQuery 31 │     335.17 / 339.10 ±3.77 / 346.15 ms │      358.70 / 365.53 ±4.23 / 370.54 ms │  1.08x slower │
│ QQuery 32 │ 1111.43 / 1132.05 ±11.54 / 1145.20 ms │  1116.99 / 1131.70 ±18.58 / 1168.24 ms │     no change │
│ QQuery 33 │ 1561.02 / 1595.05 ±23.12 / 1629.97 ms │  1579.94 / 1597.85 ±12.55 / 1613.49 ms │     no change │
│ QQuery 34 │ 1581.01 / 1597.14 ±16.20 / 1622.41 ms │  1613.04 / 1659.22 ±32.74 / 1695.30 ms │     no change │
│ QQuery 35 │    326.30 / 347.99 ±28.28 / 403.65 ms │      363.61 / 375.91 ±7.40 / 383.64 ms │  1.08x slower │
│ QQuery 36 │        66.81 / 71.03 ±3.08 / 74.24 ms │         61.93 / 69.56 ±3.97 / 72.88 ms │     no change │
│ QQuery 37 │        38.23 / 47.59 ±7.84 / 56.00 ms │         36.01 / 37.93 ±1.13 / 39.19 ms │ +1.25x faster │
│ QQuery 38 │        36.98 / 42.40 ±5.32 / 50.98 ms │         36.82 / 40.95 ±2.84 / 44.34 ms │     no change │
│ QQuery 39 │     124.67 / 131.61 ±3.65 / 134.98 ms │      132.75 / 138.62 ±5.70 / 149.19 ms │  1.05x slower │
│ QQuery 40 │        19.22 / 22.60 ±3.08 / 26.38 ms │         20.20 / 22.33 ±1.78 / 25.10 ms │     no change │
│ QQuery 41 │        18.06 / 19.63 ±1.84 / 23.23 ms │         18.50 / 19.53 ±1.62 / 22.74 ms │     no change │
│ QQuery 42 │        15.37 / 19.99 ±4.83 / 26.81 ms │         16.45 / 16.83 ±0.24 / 17.10 ms │ +1.19x faster │
└───────────┴───────────────────────────────────────┴────────────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                     │ 17973.23ms │
│ Total Time (filter-pushdown-with-row-group-morsels)   │ 18689.74ms │
│ Average Time (HEAD)                                   │   417.98ms │
│ Average Time (filter-pushdown-with-row-group-morsels) │   434.65ms │
│ Queries Faster                                        │          6 │
│ Queries Slower                                        │         17 │
│ Queries with No Change                                │         20 │
│ Queries with Failure                                  │          0 │
└───────────────────────────────────────────────────────┴────────────┘

Resource Usage

clickbench_partitioned — base (merge-base)

Metric Value
Wall time 95.0s
Peak memory 28.5 GiB
Avg memory 23.1 GiB
CPU user 955.8s
CPU sys 57.7s
Peak spill 0 B

clickbench_partitioned — branch

Metric Value
Wall time 95.0s
Peak memory 38.8 GiB
Avg memory 30.7 GiB
CPU user 963.5s
CPU sys 90.5s
Peak spill 0 B

File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

Comparing HEAD and filter-pushdown-with-row-group-morsels
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Query     ┃                                  HEAD ┃ filter-pushdown-with-row-group-morsels ┃         Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ QQuery 0  │          1.20 / 4.53 ±6.64 / 17.81 ms │           1.19 / 4.50 ±6.48 / 17.46 ms │      no change │
│ QQuery 1  │        12.65 / 13.07 ±0.22 / 13.29 ms │         14.39 / 15.23 ±0.47 / 15.78 ms │   1.17x slower │
│ QQuery 2  │        37.27 / 37.65 ±0.32 / 38.09 ms │         39.05 / 39.32 ±0.17 / 39.53 ms │      no change │
│ QQuery 3  │        32.72 / 33.33 ±0.55 / 34.26 ms │         34.52 / 34.93 ±0.25 / 35.22 ms │      no change │
│ QQuery 4  │     245.24 / 257.80 ±9.85 / 269.13 ms │      274.06 / 280.27 ±3.85 / 284.25 ms │   1.09x slower │
│ QQuery 5  │     280.55 / 290.18 ±8.07 / 301.95 ms │      311.31 / 320.03 ±5.23 / 327.08 ms │   1.10x slower │
│ QQuery 6  │           7.13 / 7.39 ±0.27 / 7.85 ms │            5.67 / 6.32 ±0.66 / 7.35 ms │  +1.17x faster │
│ QQuery 7  │        14.96 / 15.27 ±0.26 / 15.63 ms │         17.01 / 17.38 ±0.30 / 17.84 ms │   1.14x slower │
│ QQuery 8  │     338.96 / 351.23 ±7.00 / 358.67 ms │      404.00 / 409.89 ±9.03 / 427.71 ms │   1.17x slower │
│ QQuery 9  │    533.54 / 547.06 ±11.11 / 565.97 ms │     514.44 / 523.18 ±12.48 / 547.68 ms │      no change │
│ QQuery 10 │        75.26 / 76.44 ±1.33 / 78.94 ms │      103.42 / 107.95 ±2.40 / 109.84 ms │   1.41x slower │
│ QQuery 11 │        86.98 / 90.85 ±1.97 / 92.30 ms │      118.85 / 119.59 ±0.63 / 120.49 ms │   1.32x slower │
│ QQuery 12 │     276.80 / 286.15 ±9.95 / 299.28 ms │     305.75 / 328.33 ±14.74 / 351.87 ms │   1.15x slower │
│ QQuery 13 │     415.54 / 423.98 ±9.74 / 436.19 ms │     477.76 / 507.66 ±21.40 / 534.11 ms │   1.20x slower │
│ QQuery 14 │    291.35 / 308.91 ±10.04 / 321.77 ms │      346.27 / 356.43 ±9.18 / 369.14 ms │   1.15x slower │
│ QQuery 15 │    290.96 / 300.54 ±10.71 / 321.36 ms │     329.89 / 357.60 ±22.37 / 394.49 ms │   1.19x slower │
│ QQuery 16 │     649.42 / 653.43 ±4.08 / 659.70 ms │     684.40 / 705.74 ±16.13 / 733.45 ms │   1.08x slower │
│ QQuery 17 │    646.73 / 664.33 ±11.90 / 679.53 ms │     687.57 / 718.21 ±18.49 / 737.26 ms │   1.08x slower │
│ QQuery 18 │ 1289.41 / 1327.20 ±20.67 / 1348.71 ms │  1385.45 / 1495.64 ±58.98 / 1556.47 ms │   1.13x slower │
│ QQuery 19 │        29.00 / 30.86 ±2.77 / 36.31 ms │        36.91 / 47.92 ±14.27 / 75.74 ms │   1.55x slower │
│ QQuery 20 │    524.63 / 534.00 ±11.88 / 557.11 ms │     524.36 / 538.84 ±14.72 / 565.62 ms │      no change │
│ QQuery 21 │    599.18 / 613.02 ±11.91 / 633.48 ms │     592.71 / 605.61 ±10.20 / 623.79 ms │      no change │
│ QQuery 22 │ 1063.03 / 1081.45 ±10.20 / 1093.53 ms │     793.15 / 806.57 ±10.56 / 818.45 ms │  +1.34x faster │
│ QQuery 23 │ 3426.11 / 3480.51 ±41.17 / 3520.68 ms │     293.65 / 316.50 ±23.49 / 348.66 ms │ +11.00x faster │
│ QQuery 24 │        42.05 / 49.43 ±8.74 / 65.26 ms │         36.91 / 38.69 ±1.26 / 40.48 ms │  +1.28x faster │
│ QQuery 25 │     113.65 / 115.94 ±1.80 / 118.42 ms │      120.53 / 122.31 ±0.93 / 123.08 ms │   1.06x slower │
│ QQuery 26 │        42.24 / 44.87 ±4.17 / 53.16 ms │         65.20 / 65.84 ±0.54 / 66.78 ms │   1.47x slower │
│ QQuery 27 │     677.37 / 682.86 ±3.59 / 688.23 ms │      645.06 / 652.12 ±5.01 / 660.03 ms │      no change │
│ QQuery 28 │ 3061.37 / 3073.57 ±11.61 / 3088.45 ms │  3046.21 / 3085.02 ±32.45 / 3128.21 ms │      no change │
│ QQuery 29 │        42.34 / 45.93 ±4.71 / 54.46 ms │         47.89 / 51.28 ±4.42 / 59.63 ms │   1.12x slower │
│ QQuery 30 │     315.92 / 322.78 ±9.31 / 341.17 ms │     336.47 / 356.61 ±12.98 / 373.97 ms │   1.10x slower │
│ QQuery 31 │    305.30 / 319.38 ±10.58 / 332.77 ms │      332.38 / 342.30 ±5.21 / 346.86 ms │   1.07x slower │
│ QQuery 32 │ 1052.67 / 1075.35 ±26.91 / 1127.00 ms │  1219.08 / 1284.34 ±70.03 / 1417.93 ms │   1.19x slower │
│ QQuery 33 │ 1502.74 / 1549.27 ±53.72 / 1648.73 ms │  1567.91 / 1650.79 ±64.43 / 1754.33 ms │   1.07x slower │
│ QQuery 34 │ 1468.32 / 1527.95 ±36.24 / 1563.54 ms │  1564.53 / 1627.81 ±58.42 / 1717.89 ms │   1.07x slower │
│ QQuery 35 │    295.18 / 332.21 ±28.32 / 380.28 ms │     310.44 / 331.59 ±13.11 / 349.88 ms │      no change │
│ QQuery 36 │        65.25 / 72.16 ±6.63 / 83.29 ms │         61.41 / 65.46 ±5.36 / 75.68 ms │  +1.10x faster │
│ QQuery 37 │        37.55 / 41.91 ±5.00 / 49.03 ms │         35.11 / 36.40 ±0.80 / 37.29 ms │  +1.15x faster │
│ QQuery 38 │        41.30 / 43.58 ±1.57 / 45.27 ms │         36.83 / 39.39 ±2.40 / 43.18 ms │  +1.11x faster │
│ QQuery 39 │     135.02 / 145.34 ±5.43 / 150.57 ms │      115.27 / 124.28 ±7.86 / 137.16 ms │  +1.17x faster │
│ QQuery 40 │        15.64 / 18.50 ±4.21 / 26.82 ms │         19.56 / 21.38 ±1.49 / 23.82 ms │   1.16x slower │
│ QQuery 41 │        14.87 / 14.95 ±0.07 / 15.07 ms │         16.53 / 18.45 ±1.70 / 20.69 ms │   1.23x slower │
│ QQuery 42 │        14.56 / 14.76 ±0.19 / 15.00 ms │         14.37 / 15.00 ±0.42 / 15.49 ms │      no change │
└───────────┴───────────────────────────────────────┴────────────────────────────────────────┴────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                     │ 20919.95ms │
│ Total Time (filter-pushdown-with-row-group-morsels)   │ 18592.68ms │
│ Average Time (HEAD)                                   │   486.51ms │
│ Average Time (filter-pushdown-with-row-group-morsels) │   432.39ms │
│ Queries Faster                                        │          8 │
│ Queries Slower                                        │         25 │
│ Queries with No Change                                │         10 │
│ Queries with Failure                                  │          0 │
└───────────────────────────────────────────────────────┴────────────┘

Resource Usage

clickbench_partitioned — base (merge-base)

Metric Value
Wall time 105.0s
Peak memory 30.6 GiB
Avg memory 23.1 GiB
CPU user 1106.9s
CPU sys 69.0s
Peak spill 0 B

clickbench_partitioned — branch

Metric Value
Wall time 95.0s
Peak memory 34.6 GiB
Avg memory 27.0 GiB
CPU user 939.3s
CPU sys 100.5s
Peak spill 0 B

File an issue against this benchmark runner

@adriangb
Copy link
Copy Markdown
Owner Author

Local vs GKE are disagreeing — asking for an isolation run to nail it down

The headline from this run:

  • Q23: 3480 → 316 ms (11× faster) — the hash-join dynamic filter win; total time across the 43 queries drops 11 % (20.9 s → 18.6 s) on the strength of it.
  • 25 queries slower, 8 faster — so although the aggregate improves, per-query the regressions are real.

My 10-iter local run on M-series showed the lazy-morsel branch (without any adaptive code) matching or beating apache/main on every query I tested, including Q15/Q16/Q17/Q24 at pushdown=off. GKE's 16-core Neoverse-V2 is disagreeing with that.

Looking at which queries regress here:

Category Example GKE delta
No filter (morsel split alone should matter) Q15, Q16, Q17, Q32, Q33, Q34 7–19 % slower
Filter ⊆ projection Q10, Q11, Q13, Q14, Q26 15–47 % slower
Highly selective filter Q40 16 % slower
Hash-join dynamic filter Q23 91 % faster

The no-filter regressions implicate either the morsel-split fan-out or the lazy-morsel wrapper itself, not the adaptive tracker — there's no predicate to adapt on those. That contradicts my local result, which means either (a) GKE's cache/NUMA/allocator behaviour exposes an overhead my M-series hides, or (b) the local run had less variance than I gave it credit for and GKE is closer to the truth.

To isolate, could you kick off a clickbench run on PR #10 with the same config? PR #10 is now exactly main + row-group-morsel-split + ParquetLazyMorsel — no adaptive code at all. If PR #10 on GKE also regresses on the no-filter queries, the fix needs to go in the morsel/lazy layer. If PR #10 is clean on GKE, the regression is in the adaptive layer and I can dig there.

Trigger:

run benchmark clickbench_partitioned
baseline:
    ref: main
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: false
       DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS: false
changed:
    ref: HEAD
    env:
       DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS: false
       DATAFUSION_EXECUTION_PARQUET_REORDER_FILTERS: false

(posted on PR #10 directly, no adaptive env to set).

Once I know which layer owns the no-filter regression on GKE, the follow-up is clear:

  • If lazy-morsel layer — the Arc<LazyMorselShared> path may be adding allocator pressure that M-series's unified L3 hides; I can profile and see if e.g. ProjectionExprs::clone + try_map_exprs per-morsel is measurable on Neoverse.
  • If adaptive layer — the existing fast-path I just added (skip bookkeeping when post_scan_conjuncts is empty) isn't enough and we need to also skip shared.projection.clone() / try_map_exprs when the stream schema matches the projection schema.

🤖 Generated with Claude Code

@adriangb
Copy link
Copy Markdown
Owner Author

Local full 43-query ClickBench, 5 iterations, 4 configs (M-series, default morsel budgets)

             main+off     main+on     br+off     br+on    br/m-off    br/m-on
Total        26375 ms    20430 ms   29531 ms  23739 ms     1.12x      1.16x

Aggregate: branch is 12 % slower (pushdown=off) / 16 % slower (pushdown=on) than apache/main.

Surprise: Q23 is not the win the GKE report suggested

The recent GKE "11× faster on Q23" was comparing baseline=main+off (3480 ms) to branch+on (316 ms). Apples-to-apples (br+on vs main+on), Q23 is 2.28× slower than static pushdown on my laptop:

Q23 main+off main+on br+off br+on
3919.6 ms 119.3 ms 4259.3 ms 272.1 ms

Static RowFilter does 119 ms. Our adaptive does 272 ms. So yes, the adaptive system gives most of the arrow-rs speedup — but leaves 150 ms on the table that static pushdown doesn't. The GKE baseline config (pushdown=off) was the wrong comparison target for the "headline win" framing.

Biggest regressions at pushdown=on (br+on / main+on)

Q ratio main+on br+on shape
Q23 2.28× 119 272 SELECT * WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10 (selective filter, wide projection)
Q34 2.12× 1488 3150 GROUP BY 1, "URL"no filter
Q32 1.58× 2182 3453 GROUP BY "WatchID", "ClientIP"no filter
Q3 1.36× 35 48 SELECT AVG("UserID")no filter
Q4 1.17× 303 354 SELECT COUNT(DISTINCT UserID) — no filter
Q26 1.15× 57 65 WHERE SearchPhrase <> '' ORDER BY EventTime LIMIT 10
Q21, Q9, Q20, Q1, Q8 1.13-1.15× mostly no-filter or full-row-group aggregations

Biggest wins at pushdown=on

Q ratio main+on br+on shape
Q25 0.80× 146 117 WHERE MinuteOfDay = 43852
Q24 0.82× 39 32 WHERE SearchPhrase = '...' specific string match
Q37 0.84× 30 25 multi-predicate + GROUP BY + HAVING
Q22 0.89× 988 875 complex WHERE ... LIKE

What this tells us

  • The no-filter regressions (Q3, Q32, Q33, Q34) are bigger than the filter ones. Q34 at 2× and Q32 at 1.58× aren't about filter adaptation — there's no predicate to adapt on. They're morsel-split + lazy-morsel overhead on wide-row-group full scans, visible on both pushdown=off and pushdown=on.
  • Q23 underperforming static pushdown by 2.28× says our adaptive path isn't quite getting out of the way when the static choice is unambiguously correct. Likely the per-morsel partition_filters + build_row_filter work on a selective LIKE filter over a wide projection is material.
  • The wins are real but smaller — Q24/Q25/Q37/Q22 gain 10-20% from adaptive placement picking a better point than static.

Next steps

I think we should separate the two concerns:

  1. Morsel-split overhead on no-filter GROUP BYs (Q3, Q32, Q33, Q34). Two things I want to try: (a) profile ParquetLazyMorsel::build_stream_now on Q33 with samply to see where the 1.5× goes — suspect build_projection_read_plan walking a wide schema twice; (b) test whether increasing morsel_max_rows to the typical row-group size (~500k) eliminates the penalty by matching main's "1 decoder per file" behaviour.
  2. Adaptive overhead on Q23-shape queries. Want to profile apply_post_scan_filters_with_stats vs. arrow-rs RowFilter::evaluate on the same LIKE predicate + selective selection mask to see whether the delta is placement decision or per-batch path cost.

Want me to proceed with (1) first? It's the larger absolute regression and its cause is narrower (no adaptive involved).

🤖 Generated with Claude Code

@adriangb
Copy link
Copy Markdown
Owner Author

PR #10 isolation — my morsel/lazy attribution was wrong

Ran full ClickBench-partitioned + a 10-iter drill-down comparing main, PR #10 (morsel-split + lazy, no adaptive), and this PR (adaptive). Headline:

Aggregate (full 43 queries, 5 iter)

main+off main+on pr10+off pr10+on p/m+off p/m+on
Total ms 25293 19994 24888 20515 0.98× 1.03×

PR #10 is essentially at parity with main — 2 % faster at pushdown=off, 3 % slower at pushdown=on. So the lazy-morsel refactor is clean on this hardware.

10-iter drill-down on the queries I flagged yesterday

Q main+on pr10+on branch+on branch/main pr10/main
Q23 117 112 239 2.05× 0.96×
Q33 1724 1743 2297 1.33× 1.01×
Q34 1488 1507 1634 1.10× 1.01×
Q3 29 31 32 1.10× 1.05×
Q26 54 54 54 1.00× 1.00×
Q32 2383 3371 2551 1.07× 1.42×

Three things this tells us:

  1. Yesterday's attribution was wrong. I said Q3/Q32/Q33/Q34's regressions on this PR came from morsel-split + lazy. They don't — PR feat: split Parquet files into row-group-sized morsels #10 has the same morsel-split + lazy stack without the adaptive code, and Q33/Q34/Q3/Q26 all match main to within 1–5 %. Q23 on PR feat: split Parquet files into row-group-sized morsels #10 is also clean (112 ms vs main's 117 ms). The regressions on this PR are coming from the adaptive layer itself.

  2. Adaptive is hurting even when no filter is present. Q33 and Q34 have no WHERE clause at all, but this PR regresses them 33 % and 10 %. My fast-path commit (af2a26ff15) was supposed to take the no-filter, empty-post-scan path. Either it isn't firing or it isn't tight enough — the difference between PR feat: split Parquet files into row-group-sized morsels #10 and this PR on a query with pushdown=off and no predicate shouldn't exist, but it does.

  3. Q32 is an outlier the other direction. On Q32 at pushdown=on, PR feat: split Parquet files into row-group-sized morsels #10 is 1.42× slower than main (3371 vs 2383) while this PR is only 1.07× slower (2551). Small sample, noisy, but it's the only query where adaptive beats no-adaptive at the same morsel infrastructure — worth keeping an eye on.

Where to look next

Given that PR #10 is at parity with main, the clean recommendation is: merge PR #10 as-is, and pursue the adaptive layer on this PR as a separate effort that has to prove itself against main+on, not against main+off.

On the adaptive layer, the specific investigation is: on a query with pushdown_filters=false (so partition_filters isn't called and post_scan_conjuncts is empty), what is ParquetLazyMorsel::build_stream_now doing differently from PR #10's version that costs 33 % on Q33? Candidates:

  • LazyMorselShared carries extra fields (predicate_conjuncts, selectivity_tracker, projection_compressed_bytes, total_rows). Arc'd so in-theory free, but struct size grew and maybe we're seeing allocation effects.
  • total_rows and projection_compressed_bytes are always computed in build_stream even when the adaptive path won't use them. Moving these inside the if has_adaptive_work branch would skip them at pushdown=off.
  • The per-morsel path still computes post_scan_other_bytes_per_row and read_plan_exprs via conditional branches — want to confirm the no-conjuncts branch really is taking the shortest possible path and not doing any work pr10's version doesn't.

Happy to profile build_stream_now with samply on Q33 next to pin this down, if you want to keep chasing adaptive. Otherwise the lazy-morsel PR #10 is a clean standalone improvement ready to land.

🤖 Generated with Claude Code

@adriangb
Copy link
Copy Markdown
Owner Author

Investigation summary + a rebase question

What the profile showed

Instrumented build_stream_now and transition::Data arm on both PR #10 and this branch, ran Q33 (no filter, URL GROUP BY) at 20 iterations:

Region PR #10 PR #9 Δ
build_stream_now total 66 ms 216 ms +150 ms (3.3×)
- create_reader 39 ms 154 ms +115 ms (3.9×)
- decoder_builder.build() 2.1 ms 22 ms +20 ms (10×)
transition io (all threads) 16.8 s CPU 24.0 s CPU +7.2 s (+43%)
transition decode (all threads) 93.7 s CPU 100.3 s CPU +6.6 s (+7%)

The multi-thread numbers were the big hint: at target_partitions=1, PR #9 was 8594 ms on Q33 vs PR #10's 8660 ms — not slower, slightly faster. So the regression is contention, not per-call cost.

Where the contention is coming from

Unverified conclusively, but the strong suspicion is this combination:

  1. LazyMorselShared carries 4 extra fields that PR feat: split Parquet files into row-group-sized morsels #10 doesn't (predicate_conjuncts, selectivity_tracker, projection_compressed_bytes, total_rows), making its Arc'd allocation ~48 bytes bigger.
  2. build_stream unconditionally walks every projection column × row group to compute projection_compressed_bytes and sums total_rows for every file, even when there's no predicate to adapt.
  3. At high thread counts, the extra allocations + precomputation work interact with the global allocator and parquet-rs's internal state in a way that slows create_reader and decoder_builder.build() — operations that look identical to PR feat: split Parquet files into row-group-sized morsels #10 from the outside but take 4-10× longer under parallel load.

Ruled out:

  • The Option<Box<AdaptiveStreamState>> field on PushDecoderStreamState itself. Removed it entirely in an ablation build; gap unchanged.
  • Per-batch transition overhead. Per-call cost is ~470 ns vs PR feat: split Parquet files into row-group-sized morsels #10's ~410 ns — the per-batch overhead is tiny.
  • The tracker data structures. Microbench already showed those are ns-scale.

What helped so far

Commit 5856cca99b gates the file-level projection_compressed_bytes / total_rows precomputation on has_adaptive_work (pushdown on + non-empty predicate_conjuncts). Local 20-iter:

Query main before gate after gate
Q3 32 48 29.5
Q26 56 65 53.5
Q33 1487 2297 1818
Q34 1520 1634 1799
Q23 111 272 225

Q3/Q26 closed fully. Q33/Q34 went from ~33%/10% slower to ~22%/18%. Q23 still 2× on filter queries.

Full 43-query aggregate (5 iter, pushdown=on)

main pr10 pr9_gate
Total 17159 ms 17279 ms 19144 ms
ratio/main 1.01× 1.12×

So this branch is now at 1.12× main aggregate (was 1.16× before). PR #10 is at parity.

What would help next

Want to keep going but the next step is structural and probably needs its own PR:

On the rebase

I attempted to rebase this branch onto PR #10 cleanly, and the diff between PR #10 (8ef53b7ff0) and this branch's HEAD is 48 files / 3893+/1121- lines in datafusion/ alone — a lot of which is drift from apache/main at different points (workflow yml, .slt files, proto regen). A "real" rebase would mostly amount to picking the 3-4 real adaptive-filter commits from this branch on top of PR #10's tip, which is doable but meaningful work (several merge conflicts in opener.rs). Want me to just squash-cherrypick the adaptive delta as one commit on top of morsel-split-lazy, or would you prefer a proper commit-by-commit rebase? The squash is ~10 min of work, the proper rebase is closer to 2 hr.

🤖 Generated with Claude Code

Adds a runtime filter-placement layer on top of the row-group-morsel
split introduced by the parent commit. Each Parquet predicate is
assigned a `FilterId` and flows through a state machine
(`SelectivityTracker`) that moves it between three placements:

- `RowFilter` — evaluated inside the arrow-rs decoder as an
  `ArrowPredicate`, enabling late-materialization savings when the
  filter columns are a small fraction of the projection.
- `PostScan` — evaluated against the decoded wide batch before the
  projector strips it; used when late materialization has little to
  save or when the filter is CPU-heavy.
- `Dropped` — optional filters (hash-join dynamic filters wrapped in
  `OptionalFilterPhysicalExpr`) are skipped mid-stream when their CI
  upper bound on bytes-saved-per-second falls below a minimum.

Initial placement uses a cheap byte-ratio heuristic
(`filter_compressed_bytes / projection_compressed_bytes`); subsequent
placements refine using Welford online stats reported from both the
row-filter path (`DatafusionArrowPredicate::evaluate`) and the
post-scan path (`apply_post_scan_filters_with_stats`). Placement is
re-evaluated per morsel, so stats from the prior morsel's scan feed
into the next morsel's decision.

Config knobs on `TableParquetOptions.execution.parquet`:
- `filter_pushdown_min_bytes_per_sec` (default 100 MB/s)
- `filter_collecting_byte_ratio_threshold` (default 0.20)
- `filter_confidence_z` (default 2.0 ≈ 97.5% one-sided CI)

The `reorder_filters` option is removed; the adaptive tracker
subsumes its role.

Notable trade-offs documented in PR discussion:
- The adaptive layer adds ~10 % aggregate ClickBench overhead vs the
  pure morsel-split base (PR #10). Most of it lives in
  `ParquetLazyMorsel::build_stream_now` under parallel load; single-
  thread shows no regression. Candidate fix is splitting adaptive
  state out of `LazyMorselShared` so non-adaptive queries get the
  same `Arc` allocation shape as PR #10.
- The `OptionalFilterPhysicalExpr` wrapper changes plan display
  output (`DynamicFilter [...]` → `Optional(DynamicFilter [...])`);
  several sqllogictest expected outputs and snapshot tests were
  updated accordingly.
- A selectivity-tracker microbench was added under
  `benches/selectivity_tracker.rs` so future iterations on the
  tracker can be measured independently of full ClickBench.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@adriangb
Copy link
Copy Markdown
Owner Author

Rebase done — and it uncovered the root cause

Squashed the adaptive delta as a single commit on top of PR #10's morsel-split-lazy:

  • filter-pushdown-rebased = ff805cf67e (morsel) → 8ef53b7ff0 (lazy) → dbcf5ac1e7 (adaptive)
  • 3 commits, 43 files changed in the adaptive commit, all tests pass, clippy clean.

But the interesting thing is what rebasing did to the numbers

Full 43-query ClickBench, pushdown=on, 5 iter:

main pr10 rebased
Sum (ms) 16659 16418 16046
vs main 0.99× 0.96×

The rebased branch is 4 % faster than apache/main in aggregate — and faster than PR #10 too. Contrast with the un-rebased branch which was 1.12× slower than main last run.

Q33 verification across 3 trials (the query that was the stickiest regression on the un-rebased PR):

trial 1 trial 2 trial 3
un-rebased 1810 ms 1989 ms 1929 ms
rebased 1459 ms 1482 ms 1462 ms

Rebased Q33 is at parity with PR #10 (1443 ms), ~24 % faster than the un-rebased branch.

opener.rs is byte-for-byte identical between the two branches. The entire difference is what changed in apache/main between the un-rebased base and the current base. The most likely cause is 526f0cb10e perf: Reduce Box and Arc allocation churn during tree rewriting (#21749), which is exactly the kind of thing my earlier investigation pointed at (Q33/Q34 contention on parallel allocator under the larger Arc<LazyMorselShared> allocation). The upstream fix reduces allocation churn in a path that this PR exercises heavily; that's what was amplifying the 48-byte larger LazyMorselShared into a 25 % wall-clock regression.

Per-query vs main on the rebased branch (top 10 each, pushdown=on)

Wins:

Q ratio main rebased
Q22 0.81× 969 781
Q32 0.81× 1425 1158
Q18 0.86× 1565 1349
Q25 0.87× 135 117

Regressions:

Q ratio main rebased
Q23 1.78× 119 212
Q2 1.07× 35 37
Q16, Q21 1.03×

Q23 still regresses ~1.8× — this is the "filter ⊆ projection with a selective LIKE" query I diagnosed earlier. Adaptive's byte-ratio heuristic starts it as PostScan (byte_ratio ≈ 1), and with other_bytes_per_row ≈ 0 the CI lower bound never crosses min_bytes_per_sec, so promotion never fires. Static main+on routes it to RowFilter directly and wins by 100 ms. This is a real adaptive-heuristic issue, separable from the contention problem.

Where this leaves us

  • The contention problem is solved by the rebase alone (upstream caught up).
  • Q23 remains the one clear adaptive-specific regression, with a clear conceptual fix (add a selectivity-rate promotion path alongside the bytes-per-sec one). Separate follow-up.
  • PR Adaptive filter scheduling + row-group morsel split #9 can now retarget to filter-pushdown-rebased if you want a clean 3-commit PR on top of morsel-split-lazy, or we can keep the existing 15-commit history if you prefer preserving the evolution.

🤖 Generated with Claude Code

@adriangb adriangb force-pushed the filter-pushdown-with-row-group-morsels branch from 5856cca to dbcf5ac Compare April 22, 2026 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants