Cloudberry Database version
No response
What happened
TPCH SQL16 plan shows degradation of performance . One reason: the not in part isn't parallel.
select
p_brand,
p_type,
p_size,
count(distinct ps_suppkey) as supplier_cnt
from
partsupp,
part
where
p_partkey = ps_partkey
and p_brand <> 'Brand#13'
and p_type not like 'MEDIUM POLISHED%'
and p_size in (40, 23, 12, 20, 48, 24, 21, 29)
and ps_suppkey not in (
select
s_suppkey
from
supplier
where
s_comment like '%Customer%Complaints%'
)
group by
p_brand,
p_type,
p_size
order by
supplier_cnt desc,
p_brand,
p_type,
p_size;
optimizer = off
enable_parallel = on
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather Motion 6:1 (slice1; segments: 6) (cost=60277.13..60277.37 rows=20 width=44)
Merge Key: (count(DISTINCT partsupp.ps_suppkey)), part.p_brand, part.p_type, part.p_size
-> Sort (cost=60277.13..60277.13 rows=3 width=44)
Sort Key: (count(DISTINCT partsupp.ps_suppkey)) DESC, part.p_brand, part.p_type, part.p_size
-> GroupAggregate (cost=60277.02..60277.10 rows=3 width=44)
Group Key: part.p_brand, part.p_type, part.p_size
-> Sort (cost=60277.02..60277.03 rows=3 width=40)
Sort Key: part.p_brand, part.p_type, part.p_size
-> Redistribute Motion 6:6 (slice2; segments: 6) (cost=48369.17..60276.99 rows=3 width=40)
Hash Key: part.p_brand, part.p_type, part.p_size
Hash Module: 3
-> Hash Join (cost=48369.17..60276.92 rows=3 width=40)
Hash Cond: (part.p_partkey = partsupp.ps_partkey)
-> Parallel Seq Scan on part (cost=0.00..11718.33 rows=50493 width=40)
Filter: ((p_brand <> 'Brand#13'::bpchar) AND ((p_type)::text !~~ 'MEDIUM POLISHED%'::text) AND (p_size = ANY ('{40,23,12,20,48,24,21,29}'::integer[])))
-> Hash (cost=48368.89..48368.89 rows=23 width=8)
-> Broadcast Motion 3:6 (slice3; segments: 3) (cost=600.93..48368.89 rows=23 width=8)
-> Hash Left Anti Semi (Not-In) Join (cost=600.93..48368.28 rows=8 width=8)
Hash Cond: (partsupp.ps_suppkey = supplier.s_suppkey)
-> Seq Scan on partsupp (cost=0.00..41100.55 rows=2666455 width=8)
-> Hash (cost=600.80..600.80 rows=10 width=4)
-> Broadcast Motion 3:3 (slice4; segments: 3) (cost=0.00..600.80 rows=10 width=4)
-> Seq Scan on supplier (cost=0.00..600.67 rows=3 width=4)
Filter: ((s_comment)::text ~~ '%Customer%Complaints%'::text)
What you think should happen instead
For parallel-aware NOT IN HashJoin, it's hard to tell null values in which batches.
But the parallel-oblivious NOT IN HashJoin is possible as we will build all inner tables first.
Parallel-oblivious Hash Join should be enabled.
How to reproduce
Run TPCH16 SQL or just a NOT IN sql.
Operating System
ubuntu
Anything else
No response
Are you willing to submit PR?
Code of Conduct
Cloudberry Database version
No response
What happened
TPCH SQL16 plan shows degradation of performance . One reason: the not in part isn't parallel.
What you think should happen instead
For parallel-aware NOT IN HashJoin, it's hard to tell null values in which batches.
But the parallel-oblivious NOT IN HashJoin is possible as we will build all inner tables first.
Parallel-oblivious Hash Join should be enabled.
How to reproduce
Run TPCH16 SQL or just a NOT IN sql.
Operating System
ubuntu
Anything else
No response
Are you willing to submit PR?
Code of Conduct