ARROW-3762: [C++/Python] Support reading Parquet BYTE_ARRAY columns containing over 2GB of data #3171

wesm · 2018-12-13T23:15:36Z

This patch ended up being a bit more of a bloodbath than I planned: please accept my apologies.

Associated changes in this patch:

Split up builder.h/builder.cc into a new arrow/array directory. Public arrow/builder.h API preserved. I think this code is going to keep growing more specialized components, so I think we should get out ahead of it by having a subdirectory to contain files related to implementation details
Implement ChunkedBinaryBuilder, ChunkedStringBuilder classes, add tests and benchmarks
Deprecate parquet::arrow methods returning Array
Allow implicit construction of Datum from its variant types (makes for a lot nicer syntax)

As far as what code to review, focus efforts on

src/parquet/arrow
src/arrow/array/builder_binary.h/cc, array-binary-test.cc, builder-benchmark
src/arrow/compute changes
Python changes

I'm going to tackle ARROW-2970 which should not be complicated after this patch; I will submit that as a PR after this is reviews and merged.

wesm · 2018-12-13T23:15:45Z

cc @xhochy @pitrou

wesm · 2018-12-13T23:18:59Z

@kszucs we aren't running the "large_memory" unit tests in Travis CI. What do you think about having a Docker target where we can run these so they can at least be spot-checked periodically?

pitrou

Partial review.

cpp/src/arrow/CMakeLists.txt

cpp/src/arrow/array-binary-test.cc

cpp/src/arrow/columnar/builder_binary.cc

cpp/src/arrow/compute/kernel.h

wesm · 2018-12-14T15:21:23Z

I'm all done here, just will make sure the build is passing

wesm · 2018-12-14T18:13:55Z

@xhochy @pitrou would you like to leave any more comments? The Python build timed out but otherwise this is ready to merge

pitrou · 2018-12-14T18:18:15Z

cpp/src/arrow/array/builder_binary.cc

@@ -79,23 +80,6 @@ Status BinaryBuilder::AppendNextOffset() {
  return offsets_builder_.Append(static_cast<int32_t>(num_bytes));
 }

-Status BinaryBuilder::Append(const uint8_t* value, int32_t length) {


I don't think it's useful to inline those if AppendNextOffset is not inlined.

Good point. I'll inline AppendNextOffset then

cpp/src/arrow/pretty_print.cc

cpp/src/arrow/python/numpy_to_arrow.cc

pitrou · 2018-12-14T18:25:07Z

This PR seems basically fine to me. I posted a few minor comments.

…s. add failing test case for ARROW-3762. Add ChunkedBinaryBuilder, make BinaryBuilder Append methods inline

Change-Id: I0eced60a1f8e16096a1b441b622ba750d1d59ca6

…ction of arrow::compute::Datum Change-Id: I483059a545c69a9b25d543faad641785da6bea29

…row test suite passing Change-Id: Icb260f6ffc4f41ee7519653bf8d3f48c2da30091

Change-Id: I35ab3ace0e4ca7a80fc7d85e55ac55ea222b15dc

Change-Id: I8f0a35ae4e8581790f7731ee2ed023a54caf0f31

Change-Id: I7fac456a34aa81683fa7315ae1b287be7f0d16e0

Change-Id: I47f93c7d8561b83414ab34f709fec66a6eb462d2

Change-Id: I8266354f04c8e14819fe4c72d28474e09843c13c

…tOffset Change-Id: Ibfc09617b365c937e7af6a4943c274843f6e7a33

wesm · 2018-12-14T19:11:09Z

The inlining of BinaryBuilder methods produces a meaningful benchmark improvement

before

$ ./release/arrow-builder-benchmark --benchmark_filter=BinaryArray
2018-12-14 13:09:07
Running ./release/arrow-builder-benchmark
Run on (8 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
-------------------------------------------------------------------------------------
Benchmark                                              Time           CPU Iterations
-------------------------------------------------------------------------------------
BM_BuildBinaryArray/min_time:1.000                423367 us     423376 us          3   377.915MB/s

after

$ ./release/arrow-builder-benchmark --benchmark_filter=BinaryArray
2018-12-14 13:10:27
Running ./release/arrow-builder-benchmark
Run on (8 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
----------------------------------------------------------------------------------------
Benchmark                                                 Time           CPU Iterations
----------------------------------------------------------------------------------------
BM_BuildBinaryArray/min_time:1.000                   387697 us     387698 us          3   412.692MB/s
BM_BuildChunkedBinaryArray/min_time:1.000            370996 us     371003 us          4   431.263MB/s

pitrou · 2018-12-14T19:11:58Z

Nice :-)

Change-Id: I48147645784402e7cf004a82151d66f337d1664e

wesm · 2018-12-14T20:48:54Z

+1

kszucs · 2018-12-15T22:01:25Z

@wesm created issue for running large memory tests ARROW-4046

wesm · 2018-12-15T22:01:57Z

thanks =)

zygm0nt · 2019-06-12T14:12:13Z

I still see this error when using 0.13.0, also tested with 0.12.0.

The code I've tested this with is the exact same code as in ARROW-4046:

import pyarrow as pa
import pyarrow.parquet as pq


x = pa.array(list('1' * 2**30))

demo = 'demo.parquet'


def scenario():
    t = pa.Table.from_arrays([x], ['x'])
    writer = pq.ParquetWriter(demo, t.schema)
    for i in range(2):
        writer.write_table(t)
    writer.close()

    pf = pq.ParquetFile(demo)

    # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647
    t2 = pf.read()

    # Works, but note, there are 32 row groups, not 2 as suggested by:
    # https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
    tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
    t3 = pa.concat_tables(tables)

scenario()

wesm · 2019-06-12T14:15:52Z

You mean a different JIRA than https://issues.apache.org/jira/browse/ARROW-4046, right? Can you post this on the appropriate JIRA issue or create a new one so we can track this? Thanks

zygm0nt · 2019-06-12T14:18:45Z

Right, my mistake. I've meant this one: https://issues.apache.org/jira/browse/ARROW-3762

yogeshg · 2019-06-15T00:23:36Z

Hi @wesm , Big fan of your work!
Is there a corresponding issue in java? I could not find it. I am getting this error in spark for the same dataset that throws the above error in python.

SparkException: Job aborted due to stage failure: Task 9 in stage 5.0 failed 4 times, most recent failure: Lost task 9.3 in stage 5.0 (TID 255, 10.30.253.49, executor 4): org.apache.spark.sql.execution.QueryExecutionException: Encounter error while reading parquet files. One possible cause: Parquet column cannot be converted in the corresponding files. Details: 
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:259)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:389)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:377)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable$$anonfun$run$1.apply$mcV$sp(SparkThreadLocalForwardingThreadPoolExecutor.scala:93)
    at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable$$anonfun$run$1.apply(SparkThreadLocalForwardingThreadPoolExecutor.scala:93)
    at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable$$anonfun$run$1.apply(SparkThreadLocalForwardingThreadPoolExecutor.scala:93)
    at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper$class.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:63)
    at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:90)
    at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.run(SparkThreadLocalForwardingThreadPoolExecutor.scala:93)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
    at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
    at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228)
    ... 14 more
Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -34
    at java.util.ArrayList.<init>(ArrayList.java:157)
    at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169)
    at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:811)
    at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
    at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
    ... 18 more

Driver stacktrace:

wesm · 2019-06-15T01:04:50Z

@yogeshg can you open a JIRA issue either in Arrow or Spark? I think that this is something that will have to be handled on the Spark side cc @BryanCutler

yogeshg · 2019-06-15T02:27:35Z

I'll do that soon.

…

On Fri, Jun 14, 2019, 6:05 PM Wes McKinney ***@***.***> wrote: @yogeshg <https://github.com/yogeshg> can you open a JIRA issue either in Arrow or Spark? I think that this is something that will have to be handled on the Spark side cc @BryanCutler <https://github.com/BryanCutler> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3171?email_source=notifications&email_token=AAICSYEMEXKN2DPBHM2F5Y3P2Q54FA5CNFSM4GKKNLFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXYNGHI#issuecomment-502321949>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAICSYFZN2G64YJWHWTDTZLP2Q54FANCNFSM4GKKNLFA> .

BryanCutler · 2019-06-17T21:17:35Z

@yogeshg , this might be the related issue from the Java MR Parquet Reader that Spark uses https://issues.apache.org/jira/browse/PARQUET-980, but please open another JIRA if it is not

pitrou reviewed Dec 13, 2018

View reviewed changes

cpp/src/arrow/CMakeLists.txt Outdated Show resolved Hide resolved

cpp/src/arrow/array-binary-test.cc Outdated Show resolved Hide resolved

cpp/src/arrow/columnar/builder_binary.cc Outdated Show resolved Hide resolved

cpp/src/arrow/compute/kernel.h Outdated Show resolved Hide resolved

pitrou reviewed Dec 14, 2018

View reviewed changes

cpp/src/arrow/pretty_print.cc Outdated Show resolved Hide resolved

pitrou reviewed Dec 14, 2018

View reviewed changes

cpp/src/arrow/python/numpy_to_arrow.cc Outdated Show resolved Hide resolved

wesm added 11 commits December 14, 2018 13:01

Split up builder.h, builder.cc into smaller headers, compilation unit…

7163223

…s. add failing test case for ARROW-3762. Add ChunkedBinaryBuilder, make BinaryBuilder Append methods inline

More refactoring

9228112

Change-Id: I0eced60a1f8e16096a1b441b622ba750d1d59ca6

Additional refactoring to make things chunked. Allow implicit constru…

3d075e4

…ction of arrow::compute::Datum Change-Id: I483059a545c69a9b25d543faad641785da6bea29

Finish scaffolding. Get fully compiling again and original parquet-ar…

2efae06

…row test suite passing Change-Id: Icb260f6ffc4f41ee7519653bf8d3f48c2da30091

Fix up Python bindings, unit test

81e787c

Change-Id: I35ab3ace0e4ca7a80fc7d85e55ac55ea222b15dc

Address preliminary code comments. Check in missing files

8ffaec1

Change-Id: I8f0a35ae4e8581790f7731ee2ed023a54caf0f31

Rename columnar/ directory to array/

5fdbbb2

Change-Id: I7fac456a34aa81683fa7315ae1b287be7f0d16e0

Fix deprecated API use

3669201

Change-Id: I47f93c7d8561b83414ab34f709fec66a6eb462d2

Restore sstream include to pretty_print.cc

b90eb4b

Change-Id: I8266354f04c8e14819fe4c72d28474e09843c13c

Use strnlen to compute string length. Inline BinaryBuilder::AppendNex…

5a52511

…tOffset Change-Id: Ibfc09617b365c937e7af6a4943c274843f6e7a33

Remove unimplemented and unused ChunkedBinaryBuilder ctor

695ffc9

wesm force-pushed the ARROW-3762 branch from de952a5 to 695ffc9 Compare December 14, 2018 19:05

Fix int conversion warning on Windows

8224512

Change-Id: I48147645784402e7cf004a82151d66f337d1664e

wesm closed this in 73f94c9 Dec 14, 2018

wesm deleted the ARROW-3762 branch December 14, 2018 21:17

This was referenced Sep 9, 2019

[C++] Parquet arrow::Table reads error when overflowing capacity of BinaryArray #20081

Closed

[Python/CI] Exercise large memory tests #20645

Closed

ARROW-3762: [C++/Python] Support reading Parquet BYTE_ARRAY columns containing over 2GB of data #3171

ARROW-3762: [C++/Python] Support reading Parquet BYTE_ARRAY columns containing over 2GB of data #3171

Uh oh!

Conversation

wesm commented Dec 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Dec 13, 2018

Uh oh!

wesm commented Dec 13, 2018

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wesm commented Dec 14, 2018

Uh oh!

wesm commented Dec 14, 2018

Uh oh!

pitrou Dec 14, 2018

Choose a reason for hiding this comment

Uh oh!

wesm Dec 14, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pitrou commented Dec 14, 2018

Uh oh!

wesm commented Dec 14, 2018

Uh oh!

pitrou commented Dec 14, 2018

Uh oh!

wesm commented Dec 14, 2018

Uh oh!

kszucs commented Dec 15, 2018

Uh oh!

wesm commented Dec 15, 2018

Uh oh!

zygm0nt commented Jun 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Jun 12, 2019

Uh oh!

zygm0nt commented Jun 12, 2019

Uh oh!

yogeshg commented Jun 15, 2019

Uh oh!

wesm commented Jun 15, 2019

Uh oh!

yogeshg commented Jun 15, 2019 via email

Uh oh!

BryanCutler commented Jun 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wesm commented Dec 13, 2018 •

edited

Loading

zygm0nt commented Jun 12, 2019 •

edited

Loading