Skip to content

Allow batch size to be specified in bytes in file reader for 2.0 #4369

@westonpace

Description

@westonpace

Currently the file reader requires a batch size to be specified in rows. This is difficult because the ideal batch size is usually expressed in bytes (e.g. "fits in cpu cache" or "less than 20MB") and users have to calibrate the batch size to the shape of the data they are loading (and they never do this).

It would be much better to allow a target batch size to be expressed in bytes. Note: this won't be all that trivial. We will need to enhance the structural decoders to support this operation. It isn't too hard to solve for one column but when reading multiple columns it can be difficult to know where exactly to break the batch.

I think a simple guess-and-check algorithm can probably strike a good balance (and ensures we always have a power of 2 batch size which has its own advantages):

  • Is 8Ki rows close? (nothing magical about 8Ki but it should be a power of 2 and could be dependent on the batch size bytes)
  • If no, then double / halve as appropriate
  • Repeat until we get close (or get to min_batch_size rows or max_batch_size rows)
  • Remember the value used and start from there on the next call

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions