Allow batch size to be specified in bytes in file reader for 2.0

Currently the file reader requires a batch size to be specified in rows.  This is difficult because the ideal batch size is usually expressed in bytes (e.g. "fits in cpu cache" or "less than 20MB") and users have to calibrate the batch size to the shape of the data they are loading (and they never do this).

It would be much better to allow a target batch size to be expressed in bytes.  Note: this won't be all that trivial.  We will need to enhance the structural decoders to support this operation.  It isn't too hard to solve for one column but when reading multiple columns it can be difficult to know where exactly to break the batch.

I think a simple guess-and-check algorithm can probably strike a good balance (and ensures we always have a power of 2 batch size which has its own advantages):

* Is 8Ki rows close? (nothing magical about 8Ki but it should be a power of 2 and could be dependent on the batch size bytes)
* If no, then double / halve as appropriate
* Repeat until we get close (or get to min_batch_size rows or max_batch_size rows)
* Remember the value used and start from there on the next call

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow batch size to be specified in bytes in file reader for 2.0 #4369

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow batch size to be specified in bytes in file reader for 2.0 #4369

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions