[C++][Parquet] 16MB limit on (nested) column chunk prevents tuning row_group_size

We working on parquet files that involve nested lists. At most they are multi-dimensional lists of simple types (never structs), but i understand, for Parquet, they're still nested columns and involve repetition levels. 

Some of these columns hold lists of rather large byte arrays (that dominate the overall size of the row). When we bump the `row_group_size` to above 16MB we see: 

 
```java

File "pyarrow/_parquet.pyx", line 700, in pyarrow._parquet.ParquetReader.read_row_group
 File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs
```
 

I conclude it's [this](https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L848) bit complaining:

 
```java

template <typename ParquetType>
	Status PrimitiveImpl::WrapIntoListArray(Datum* inout_array) {
	if (descr_->max_repetition_level() == 0) {
	  // Flat, no action
	  return Status::OK();
	}
	
	std::shared_ptr<Array> flat_array;
	
	// ARROW-3762(wesm): If inout_array is a chunked array, we reject as this is
	// not yet implemented
	if (inout_array->kind() == Datum::CHUNKED_ARRAY) {
	  if (inout_array->chunked_array()->num_chunks() > 1) {
	    return Status::NotImplemented(
	      "Nested data conversions not implemented for "
	      "chunked array outputs");
```
 

This appears to happen in the callstack of ColumnReader::ColumnReaderImpl::NextBatch 
and it appears to be provoked by [this](https://github.com/apache/arrow/blob/de84293d9c93fe721cd127f1a27acc94fe290f3f/cpp/src/parquet/arrow/record_reader.cc#L604) constant:
```java

template <>     
void TypedRecordReader<ByteArrayType>::InitializeBuilder() {     
  // Maximum of 16MB chunks     
  constexpr int32_t kBinaryChunksize = 1 << 24;     
  DCHECK_EQ(descr_->physical_type(), Type::BYTE_ARRAY);       
  builder_.reset(
    new::arrow::internal::ChunkedBinaryBuilder(kBinaryChunksize, pool_));  }   
```
Which appears to imply that the column chunk data, if larger than kBinaryChunksize (hardcoded to 16MB), is returned as a Datum::CHUNKED_ARRAY of more than one (16MB) chunks. Which ultimatelly leads to the Status::NotImplemented error.

We have no influence over what data we ingest, we have some influence in how we flatten it and we need to tune our row_group_size to something sensibly larger than 16MB. 

We have see no obvious workaround for this and so we need to ask (1) if the above diagnosis appears to correct (2) do people see any sensible workarounds (3) is there an imminent intention to fix this in the Arrow community and if not, how difficult would it be to fix this (in case we can afford helping)

**Reporter**: [Remek Zajac](https://issues.apache.org/jira/browse/ARROW-4688)
**Assignee**: [Wes McKinney](https://issues.apache.org/jira/browse/ARROW-4688) / @wesm
#### Related issues:
- [[Python] read_row_group fails with Nested data conversions not implemented for chunked array outputs](https://github.com/apache/arrow/issues/21526) (is related to)
#### PRs and other links:
- [GitHub Pull Request #4023](https://github.com/apache/arrow/pull/4023)

<sub>**Note**: *This issue was originally created as [ARROW-4688](https://issues.apache.org/jira/browse/ARROW-4688). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++][Parquet] 16MB limit on (nested) column chunk prevents tuning row_group_size #21216

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++][Parquet] 16MB limit on (nested) column chunk prevents tuning row_group_size #21216

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions