Skip to content

Implement fetch limit for MemoryExec #14337

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

We rely on MemoryExec for many of our tests, but MemoryExec doesn't support pushing limits down (aka fetch) This means that some bugs such as the following are harder to observe due to the fact in memory tables don't have pushdown testing:

Describe the solution you'd like

I would like MemoryExec to support "fetch" pushdown too so it mirrors the other sources and adds additional test coverage

Describe alternatives you've considered

One way would be:

  1. Add a MemoryExec::fetch field similar to
    /// Maximum number of rows to return
    fetch: usize,
  2. Implement the limit logic like this:
    if batch.num_rows() <= self.skip {
    self.skip -= batch.num_rows();
    RecordBatch::new_empty(input.schema())
    } else {
    let new_batch = batch.slice(self.skip, batch.num_rows() - self.skip);
    self.skip = 0;
    new_batch
    }
  3. Add the limit to the explain plan display:
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
    f.debug_struct("MemoryExec")
    .field("partitions", &"[...]")
    .field("schema", &self.schema)
    .field("projection", &self.projection)
    .field("sort_information", &self.sort_information)
    .finish()
    }
  4. Add tests here
    use datafusion_common::stats::{ColumnStatistics, Precision};

Additional context

I think this is a relatively self contained project for a newcomer who has done rust before but wants to get experience with DataFusion and the engine part more

The only potential challenge is if this exposes more bugs such as #14335 and will likely require updating a bunch of tests.

It might be good to do this in a few PRs:

  1. Add the limit to the MemoryExec and display in one PR (but don't actually limit the output)
  2. Implement the actual output limits

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions