Skip to content

Optimized version of SortPreservingMerge that doesn't actually compare sort keys of the key ranges are ordered #10316

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

When merging a large number of pre-sorted streams (e.g. in our case, a large number of pre-sorted parquet files) the actual work in SortPreservingMerge to keep them sorted is often substantial (as the sort key of each row in each stream must be compared the other potential candidates)

Here is the sort preserving merge

/// Sort preserving merge execution plan
///
/// This takes an input execution plan and a list of sort expressions, and
/// provided each partition of the input plan is sorted with respect to
/// these sort expressions, this operator will yield a single partition
/// that is also sorted with respect to them
///
/// ```text
/// ┌─────────────────────────┐
/// │ ┌───┬───┬───┬───┐ │
/// │ │ A │ B │ C │ D │ ... │──┐
/// │ └───┴───┴───┴───┘ │ │
/// └─────────────────────────┘ │ ┌───────────────────┐ ┌───────────────────────────────┐
/// Stream 1 │ │ │ │ ┌───┬───╦═══╦───┬───╦═══╗ │
/// ├─▶│SortPreservingMerge│───▶│ │ A │ B ║ B ║ C │ D ║ E ║ ... │
/// │ │ │ │ └───┴─▲─╩═══╩───┴───╩═══╝ │
/// ┌─────────────────────────┐ │ └───────────────────┘ └─┬─────┴───────────────────────┘
/// │ ╔═══╦═══╗ │ │
/// │ ║ B ║ E ║ ... │──┘ │
/// │ ╚═══╩═══╝ │ Note Stable Sort: the merged stream
/// └─────────────────────────┘ places equal rows from stream 1
/// Stream 2
///
///
/// Input Streams Output stream
/// (sorted) (sorted)
/// ```
#[derive(Debug)]
pub struct SortPreservingMergeExec {

However, in some cases (such as @suremarc has identified in #6672) we can use information about how the values of the sort key columns are distributed to avoid needing a sort

For example, if we have three files that are each sorted by time and have the following ranges

  • file1.paquet min(time) = 2024-01-01 and max(time) = 2024-01-31
  • file2.paquet min(time) = 2024-02-01 and max(time) = 2024-02-28
  • file3.paquet min(time) = 2024-03-01 and max(time) = 2024-03-31

We can produce the output sorted stream by first reading file1.parquet entirely then file2.parquet, then file3.parquet

Not only will this be faster than using SortPreservingMerge it will require less intermediate memory as we don't need to read a batch from each input stream to begin producing output. For cases where there may be 100s of files, this can minimize the amount of concurrently outstanding requests substantially

Also, for a query that will not read the entire dataset (e.g. only wants the most recent values) it can be especially beneficial:

SELECT * FROM data ORDER BY time limit 10

In this case our example above would only ever read file1.parquet (wouldn't even open the others) if it had more than 10 rows

Describe the solution you'd like

I would like an operator that does not actually merge if not needed

Describe alternatives you've considered

@NGA-TRAN implemented the following operator in InfluxDB IOx ProgressiveEval which we have found works pretty well and has offered to contribute it back upstream

We wrote about using this operator here: https://www.influxdata.com/blog/making-recent-value-queries-hundreds-times-faster/

/// ProgressiveEval return a stream of record batches in the order of its inputs.
/// It will stop when the number of output rows reach the given limit.
///
/// This takes an input execution plan and a number n, and provided each partition of
/// the input plan is in an expected order, this operator will return top record batches that covers the top n rows
/// in the order of the input plan.
///
/// ```text
/// ┌─────────────────────────┐
/// │ ┌───┬───┬───┬───┐       │
/// │ │ A │ B │ C │ D │       │──┐
/// │ └───┴───┴───┴───┘       │  │
/// └─────────────────────────┘  │  ┌───────────────────┐    ┌───────────────────────────────┐
///   Stream 1                   │  │                   │    │ ┌───┬───╦═══╦───┬───╦═══╗     │
///                              ├─▶│  ProgressiveEval  │───▶│ │ A │ B ║ C ║ D │ M ║ N ║ ... │
///                              │  │                   │    │ └───┴─▲─╩═══╩───┴───╩═══╝     │
/// ┌─────────────────────────┐  │  └───────────────────┘    └─┬─────┴───────────────────────┘
/// │ ╔═══╦═══╗               │  │
/// │ ║ M ║ N ║               │──┘                             │
/// │ ╚═══╩═══╝               │                Output only include top record batches that cover top N rows
/// └─────────────────────────┘                
///   Stream 2
///
///
///  Input Streams                                             Output stream
///  (in some order)                                           (in same order)
/// ```

Additional context

The original inspiration for this operator came from @pauldix (who I think mentioned it was inspired by ElasticSearch)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions