-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the enhancement requested
Previously, an issue ( #35825 ) shows that directly read large binary by dict is not supported.
During writing to parquet, we don't allow a single ByteArray to exceeds 2GB. So, any single binary would be less than 2GB.
The parquet binary reader, which is separate into two styles of API, could be shown as below:
class BinaryRecordReader : virtual public RecordReader {
public:
virtual std::vector<std::shared_ptr<::arrow::Array>> GetBuilderChunks() = 0;
};
/// \brief Read records directly to dictionary-encoded Arrow form (int32
/// indices). Only valid for BYTE_ARRAY columns
class DictionaryRecordReader : virtual public RecordReader {
public:
virtual std::shared_ptr<::arrow::ChunkedArray> GetResult() = 0;
};The api above, Both of these api don't support read "LargeBinary", however, the first api is able to separate the string into multiple separate chunk. When a BinaryBuilder reaches 2GB, it will rotate and switch to a new Binary. The api below can casting the result data to segments of large binary:
Status TransferColumnData(RecordReader* reader, const std::shared_ptr<Field>& value_field,
const ColumnDescriptor* descr, MemoryPool* pool,
std::shared_ptr<ChunkedArray>* out) For Dictionary, though the api returns a std::shared_ptr<::arrow::ChunkedArray>. However, only one dictionary builder would be used. I think we can apply the same way for it.
Pros: we can support read more than 2GB data into dictionary column
Cons: data might be repeated among different dictionary columns. Maybe user should call "Concat" on that
Component(s)
C++, Parquet