[C++][Parquet] ByteArray Reader: Extend current DictReader to supports building a LargeBinary

### Describe the enhancement requested

Previously, an issue ( https://github.com/apache/arrow/pull/35825 ) shows that directly read large binary by dict is not supported.

During writing to parquet, we don't allow a single ByteArray to exceeds 2GB. So, any single binary would be less than 2GB.

The parquet binary reader, which is separate into two styles of API, could be shown as below:

```c++
class BinaryRecordReader : virtual public RecordReader {
 public:
  virtual std::vector<std::shared_ptr<::arrow::Array>> GetBuilderChunks() = 0;
};

/// \brief Read records directly to dictionary-encoded Arrow form (int32
/// indices). Only valid for BYTE_ARRAY columns
class DictionaryRecordReader : virtual public RecordReader {
 public:
  virtual std::shared_ptr<::arrow::ChunkedArray> GetResult() = 0;
};
```

The api above, Both of these api don't support read "LargeBinary", however, the first api is able to separate the string into multiple separate chunk. When a `BinaryBuilder` reaches 2GB, it will rotate and switch to a new Binary. The api below can casting the result data to segments of large binary:

```c++
Status TransferColumnData(RecordReader* reader, const std::shared_ptr<Field>& value_field,
                          const ColumnDescriptor* descr, MemoryPool* pool,
                          std::shared_ptr<ChunkedArray>* out) 
```

For `Dictionary`, though the api returns a `std::shared_ptr<::arrow::ChunkedArray>`. However, only one dictionary builder would be used. I think we can apply the same way for it.

Pros: we can support read more than 2GB data into dictionary column
Cons: data might be repeated among different dictionary columns. Maybe user should call "Concat" on that

### Component(s)

C++, Parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++][Parquet] ByteArray Reader: Extend current DictReader to supports building a LargeBinary #41104

Describe the enhancement requested

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++][Parquet] ByteArray Reader: Extend current DictReader to supports building a LargeBinary #41104

Description

Describe the enhancement requested

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions