Skip to content

[EPIC] Improve performance of TPC-H queries #391

@andygrove

Description

@andygrove

What is the problem the feature request solves?

This epic is for tracking progress on improving performance of Comet with our benchmarks derived from TPC-H.

Current status (September 2024)

  • Comet is 1.6x faster than Spark
  • Comet is not as fast as other DataFusion subprojects yet
  • All of these DataFusion subprojects are performing similar native execution, which indicates that there is room to improve on Comet's current performance
Screenshot 2024-09-20 at 10 08 15 AM

Features needed to support all queries natively

We do not run all queries fully natively yet due to these missing features:

Planned features that could help in general

Issues that affect multiple queries

  • Scans are sometimes slower due to dictionary encoding or decoding, and it may be better if we can defer this until later in the query, but this is not really possible at the moment because DataFusion requires that all batches with a stream have the same physical type, so we cannot match Utf and Dictionary for example
  • CometExchange is sometimes slower than Spark's exchange even though it reads and writes less data.

Per-Query Tracking

Most of these queries are already faster with Comet enabled. Here are notes on areas where performance could potentially be improved.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions