Most of our APIs assume inputs are a stream. This is nice in that it supports larger-than-memory writes. However, if data is fully materialized, we can often do things more optimally. To give a few examples:
- If we have 2 million rows in memory we want to insert, we can write two data files in parallel. Currently we write them sequentially.
- To support retries for write operations, we buffer data on disk. This could by bypassed if the data is in memory.
- For merge_insert, we can compute basic statistics like
num_rows and num_bytes, which can be used by DataFusion to optimize the join order. Currently we always use the table id column as the build side, but for large tables that is suboptimal.
Having an API would also support other downstream use cases: lancedb/lancedb#2602
API
In Rust, define an enum and conversion traits to take common input using generic APIs:
struct InputData {
Stream(SendableRecordBatchStream)
Materialized {
batches: Vec<RecordBatch>,
schema: SchemaRef,
}
}
pub fn insert(data: impl Into<InputData>) -> { ... }
impl From<RecordBatch> for InputData { ... }
impl From<Vec<RecordBatch>> for InputData { ... }
impl From<Box<dyn RecordBatchReader>> for InputData { ... }
In Python, we want to make sure various inputs gets converted to the correct type.
Materialized:
pa.Table
pd.DataFrame
pa.RecordBatch
Stream:
pa.RecordBatchReader
pa.Dataset
pa.Scanner
TODO
(Note: we'll leave use case 1 for a follow up)
Most of our APIs assume inputs are a stream. This is nice in that it supports larger-than-memory writes. However, if data is fully materialized, we can often do things more optimally. To give a few examples:
num_rowsandnum_bytes, which can be used by DataFusion to optimize the join order. Currently we always use the table id column as the build side, but for large tables that is suboptimal.Having an API would also support other downstream use cases: lancedb/lancedb#2602
API
In Rust, define an enum and conversion traits to take common input using generic APIs:
In Python, we want to make sure various inputs gets converted to the correct type.
Materialized:
pa.Tablepd.DataFramepa.RecordBatchStream:
pa.RecordBatchReaderpa.Datasetpa.ScannerTODO
InputDataand conversion traitsimpl Into<InputData>merge_insertconvertsInputData::Materializedinto MemTable instead ofOneShotPartitionStream.new_source_iterto not spill when usingInputData::Materialized(Note: we'll leave use case 1 for a follow up)