[EPIC] Improve performance of TPC-H queries

### What is the problem the feature request solves?

This epic is for tracking progress on improving performance of Comet with our benchmarks derived from TPC-H.

### Current status (September 2024)

- Comet is 1.6x faster than Spark
- Comet is not as fast as other DataFusion subprojects yet
- All of these DataFusion subprojects are performing similar native execution, which indicates that there is room to improve on Comet's current performance

<img width="643" alt="Screenshot 2024-09-20 at 10 08 15 AM" src="https://github.com/user-attachments/assets/dbffcbce-a4f0-45cb-bbe0-d258e2f19305">

### Features needed to support all queries natively

We do not run all queries fully natively yet due to these missing features:

- [ ] #398 (q17, q19, q20, q21)
- [ ] #457 (q16)
- [ ] #458 (many queries)
- [x] #846 (q5, q7, q8, q20, q21)
- [x] https://github.com/apache/datafusion-comet/issues/1208

### Planned features that could help in general

- [x] https://github.com/apache/datafusion-comet/issues/1123
- [x] https://github.com/apache/datafusion-comet/issues/1006
- [x] https://github.com/apache/datafusion-comet/issues/708
- [ ] Implement native RowToColumnarExec
- [ ] #951
- [ ] #942
- [ ] #963
- [ ] #938

### Issues that affect multiple queries

- [ ] Scans are sometimes slower due to dictionary encoding or decoding, and it may be better if we can defer this until later in the query, but this is not really possible at the moment because DataFusion requires that all batches with a stream have the same physical type, so we cannot match Utf and Dictionary<Utf8> for example
- [ ] CometExchange is sometimes slower than Spark's exchange even though it reads and writes less data.

### Per-Query Tracking

Most of these queries are already faster with Comet enabled. Here are notes on areas where performance could potentially be improved.

- q1
  - https://github.com/apache/datafusion-comet/issues/942
- q2
  - Some scans are slower, partly due to dictionary unpacking cost
  - Some exchanges are slower (could this also be due to premature unpacking of dictionaries?)
- q3
- q4
- q5
  - #846
- q6
- q7
  - #846
- q8
  - #846
- q9
  - 82% of the time is in SortExec + SortMergeJoinExec. Ballista uses a HashJoinExec.
- q10
- q11
- q12
- q13
- q14
  - `lineitem` scans take 2x longer in Comet, but this is offset by avoiding an expensive C2R. The time for native decoding in Comet is longer than the entire scan in Spark.
  - #942
- q15
- q16
  - #457
- q17
  - #398
- q18
- q19
  - #398
- q20
  - #846 
  - #398
- q21
  - #846 
  - #398
- q22




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Improve performance of TPC-H queries #391

What is the problem the feature request solves?

Current status (September 2024)

Features needed to support all queries natively

Planned features that could help in general

Issues that affect multiple queries

Per-Query Tracking

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[EPIC] Improve performance of TPC-H queries #391

Description

What is the problem the feature request solves?

Current status (September 2024)

Features needed to support all queries natively

Planned features that could help in general

Issues that affect multiple queries

Per-Query Tracking

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions