Skip to content

[C++][Parquet] ByteArray Reader: Extend current DictReader to supports building a LargeBinary #41104

@mapleFU

Description

@mapleFU

Describe the enhancement requested

Previously, an issue ( #35825 ) shows that directly read large binary by dict is not supported.

During writing to parquet, we don't allow a single ByteArray to exceeds 2GB. So, any single binary would be less than 2GB.

The parquet binary reader, which is separate into two styles of API, could be shown as below:

class BinaryRecordReader : virtual public RecordReader {
 public:
  virtual std::vector<std::shared_ptr<::arrow::Array>> GetBuilderChunks() = 0;
};

/// \brief Read records directly to dictionary-encoded Arrow form (int32
/// indices). Only valid for BYTE_ARRAY columns
class DictionaryRecordReader : virtual public RecordReader {
 public:
  virtual std::shared_ptr<::arrow::ChunkedArray> GetResult() = 0;
};

The api above, Both of these api don't support read "LargeBinary", however, the first api is able to separate the string into multiple separate chunk. When a BinaryBuilder reaches 2GB, it will rotate and switch to a new Binary. The api below can casting the result data to segments of large binary:

Status TransferColumnData(RecordReader* reader, const std::shared_ptr<Field>& value_field,
                          const ColumnDescriptor* descr, MemoryPool* pool,
                          std::shared_ptr<ChunkedArray>* out) 

For Dictionary, though the api returns a std::shared_ptr<::arrow::ChunkedArray>. However, only one dictionary builder would be used. I think we can apply the same way for it.

Pros: we can support read more than 2GB data into dictionary column
Cons: data might be repeated among different dictionary columns. Maybe user should call "Concat" on that

Component(s)

C++, Parquet

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions