Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Dec 13, 2018

This patch ended up being a bit more of a bloodbath than I planned: please accept my apologies.

Associated changes in this patch:

  • Split up builder.h/builder.cc into a new arrow/array directory. Public arrow/builder.h API preserved. I think this code is going to keep growing more specialized components, so I think we should get out ahead of it by having a subdirectory to contain files related to implementation details
  • Implement ChunkedBinaryBuilder, ChunkedStringBuilder classes, add tests and benchmarks
  • Deprecate parquet::arrow methods returning Array
  • Allow implicit construction of Datum from its variant types (makes for a lot nicer syntax)

As far as what code to review, focus efforts on

  • src/parquet/arrow
  • src/arrow/array/builder_binary.h/cc, array-binary-test.cc, builder-benchmark
  • src/arrow/compute changes
  • Python changes

I'm going to tackle ARROW-2970 which should not be complicated after this patch; I will submit that as a PR after this is reviews and merged.

@wesm
Copy link
Member Author

wesm commented Dec 13, 2018

cc @xhochy @pitrou

@wesm
Copy link
Member Author

wesm commented Dec 13, 2018

@kszucs we aren't running the "large_memory" unit tests in Travis CI. What do you think about having a Docker target where we can run these so they can at least be spot-checked periodically?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review.

@wesm
Copy link
Member Author

wesm commented Dec 14, 2018

I'm all done here, just will make sure the build is passing

@wesm
Copy link
Member Author

wesm commented Dec 14, 2018

@xhochy @pitrou would you like to leave any more comments? The Python build timed out but otherwise this is ready to merge

@@ -79,23 +80,6 @@ Status BinaryBuilder::AppendNextOffset() {
return offsets_builder_.Append(static_cast<int32_t>(num_bytes));
}

Status BinaryBuilder::Append(const uint8_t* value, int32_t length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's useful to inline those if AppendNextOffset is not inlined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll inline AppendNextOffset then

@pitrou
Copy link
Member

pitrou commented Dec 14, 2018

This PR seems basically fine to me. I posted a few minor comments.

wesm added 11 commits December 14, 2018 13:01
…s. add

failing test case for ARROW-3762. Add ChunkedBinaryBuilder, make BinaryBuilder
Append methods inline
Change-Id: I0eced60a1f8e16096a1b441b622ba750d1d59ca6
…ction of arrow::compute::Datum

Change-Id: I483059a545c69a9b25d543faad641785da6bea29
…row test suite passing

Change-Id: Icb260f6ffc4f41ee7519653bf8d3f48c2da30091
Change-Id: I35ab3ace0e4ca7a80fc7d85e55ac55ea222b15dc
Change-Id: I8f0a35ae4e8581790f7731ee2ed023a54caf0f31
Change-Id: I7fac456a34aa81683fa7315ae1b287be7f0d16e0
Change-Id: I47f93c7d8561b83414ab34f709fec66a6eb462d2
Change-Id: I8266354f04c8e14819fe4c72d28474e09843c13c
…tOffset

Change-Id: Ibfc09617b365c937e7af6a4943c274843f6e7a33
@wesm
Copy link
Member Author

wesm commented Dec 14, 2018

The inlining of BinaryBuilder methods produces a meaningful benchmark improvement

before

$ ./release/arrow-builder-benchmark --benchmark_filter=BinaryArray
2018-12-14 13:09:07
Running ./release/arrow-builder-benchmark
Run on (8 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
-------------------------------------------------------------------------------------
Benchmark                                              Time           CPU Iterations
-------------------------------------------------------------------------------------
BM_BuildBinaryArray/min_time:1.000                423367 us     423376 us          3   377.915MB/s

after

$ ./release/arrow-builder-benchmark --benchmark_filter=BinaryArray
2018-12-14 13:10:27
Running ./release/arrow-builder-benchmark
Run on (8 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
----------------------------------------------------------------------------------------
Benchmark                                                 Time           CPU Iterations
----------------------------------------------------------------------------------------
BM_BuildBinaryArray/min_time:1.000                   387697 us     387698 us          3   412.692MB/s
BM_BuildChunkedBinaryArray/min_time:1.000            370996 us     371003 us          4   431.263MB/s

@pitrou
Copy link
Member

pitrou commented Dec 14, 2018

Nice :-)

Change-Id: I48147645784402e7cf004a82151d66f337d1664e
@wesm
Copy link
Member Author

wesm commented Dec 14, 2018

+1

@wesm wesm closed this in 73f94c9 Dec 14, 2018
@wesm wesm deleted the ARROW-3762 branch December 14, 2018 21:17
@kszucs
Copy link
Member

kszucs commented Dec 15, 2018

@wesm created issue for running large memory tests ARROW-4046

@wesm
Copy link
Member Author

wesm commented Dec 15, 2018

thanks =)

@zygm0nt
Copy link

zygm0nt commented Jun 12, 2019

I still see this error when using 0.13.0, also tested with 0.12.0.

The code I've tested this with is the exact same code as in ARROW-4046:

import pyarrow as pa
import pyarrow.parquet as pq


x = pa.array(list('1' * 2**30))

demo = 'demo.parquet'


def scenario():
    t = pa.Table.from_arrays([x], ['x'])
    writer = pq.ParquetWriter(demo, t.schema)
    for i in range(2):
        writer.write_table(t)
    writer.close()

    pf = pq.ParquetFile(demo)

    # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647
    t2 = pf.read()

    # Works, but note, there are 32 row groups, not 2 as suggested by:
    # https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
    tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
    t3 = pa.concat_tables(tables)

scenario()

@wesm
Copy link
Member Author

wesm commented Jun 12, 2019

You mean a different JIRA than https://issues.apache.org/jira/browse/ARROW-4046, right? Can you post this on the appropriate JIRA issue or create a new one so we can track this? Thanks

@zygm0nt
Copy link

zygm0nt commented Jun 12, 2019

Right, my mistake. I've meant this one: https://issues.apache.org/jira/browse/ARROW-3762

@yogeshg
Copy link

yogeshg commented Jun 15, 2019

Hi @wesm , Big fan of your work!
Is there a corresponding issue in java? I could not find it. I am getting this error in spark for the same dataset that throws the above error in python.

SparkException: Job aborted due to stage failure: Task 9 in stage 5.0 failed 4 times, most recent failure: Lost task 9.3 in stage 5.0 (TID 255, 10.30.253.49, executor 4): org.apache.spark.sql.execution.QueryExecutionException: Encounter error while reading parquet files. One possible cause: Parquet column cannot be converted in the corresponding files. Details: 
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:259)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:389)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anonfun$prepareNextFile$1.apply(FileScanRDD.scala:377)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable$$anonfun$run$1.apply$mcV$sp(SparkThreadLocalForwardingThreadPoolExecutor.scala:93)
    at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable$$anonfun$run$1.apply(SparkThreadLocalForwardingThreadPoolExecutor.scala:93)
    at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable$$anonfun$run$1.apply(SparkThreadLocalForwardingThreadPoolExecutor.scala:93)
    at org.apache.spark.util.threads.SparkThreadLocalCapturingHelper$class.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:63)
    at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:90)
    at org.apache.spark.util.threads.SparkThreadLocalCapturingRunnable.run(SparkThreadLocalForwardingThreadPoolExecutor.scala:93)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
    at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
    at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228)
    ... 14 more
Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -34
    at java.util.ArrayList.<init>(ArrayList.java:157)
    at org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169)
    at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:811)
    at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
    at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
    ... 18 more

Driver stacktrace:

@wesm
Copy link
Member Author

wesm commented Jun 15, 2019

@yogeshg can you open a JIRA issue either in Arrow or Spark? I think that this is something that will have to be handled on the Spark side cc @BryanCutler

@yogeshg
Copy link

yogeshg commented Jun 15, 2019 via email

@BryanCutler
Copy link
Member

@yogeshg , this might be the related issue from the Java MR Parquet Reader that Spark uses https://issues.apache.org/jira/browse/PARQUET-980, but please open another JIRA if it is not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants