Skip to content

Push limit to sort #3528

@Dandandan

Description

@Dandandan

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We can use the availability of a Limit to speed up sorts in DataFusion:

Describe the solution you'd like
We can start utilizing the limit in different places:

  • lexsort_to_indices supports a limit argument - this argument can be filled (if limit is smaller than batch size)
  • a LocalLimit can be added before SortPreservingMergeExec (after Execute sort in parallel when a limit is used after sort #3527), limiting the input for this
  • Probably SortPreservingMergeStream can start utilizing optional limits too - only merging the first n sorted rows for each batch.

This issue is about adding the limit to lexsort_to_indices. This will accomplish the following:

  • using the faster partial_sort
  • reducing the take input and thus size of output batches
  • reducing the memory needed for SortPreservingMergeExec (and spill to disk when running OOM / shuffle data in Ballista)

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformanceMake DataFusion faster

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions