ARROW-17599: [C++] Change the way how arrow reads parquet buffered files #14226

aucahuasi · 2022-09-24T02:45:15Z

Jira ticket: https://issues.apache.org/jira/browse/ARROW-17599
Given that the API of ReadRangeCache::read is retaining the buffer handlers until the end of the file reader, we need to change the way how the parquet reader reads buffered data, this is a potential solution to avoid loading all the row groups in memory.
There are historical reasons for the current design of ReadRangeCache::read, this PR will not change that API, instead, this PR will change the way how we are using the pre buffering process for reading parquet files (there will be a similar PR later to change the behavior of the IPC reader as well)
Additionally, this PR will add:

A unit test to for ReadRangeCache::read to make sure ReadRangeCache is retaining the memory.
~~Update the API doc for the ReadRangeCache::read to indicate that the buffer data is outliving until the end of the file reader's scope/life.~~
A new unit test for the changes in FileReaderImpl::GetRecordBatchGenerator

github-actions · 2022-09-24T02:45:35Z

https://issues.apache.org/jira/browse/ARROW-17599

westonpace

I think I see what you are after. I'm not entirely certain if it works or not. I had kind of thought the extra calls to PreBuffer would happen inside the generator.

westonpace · 2022-09-27T05:57:04Z

cpp/src/parquet/arrow/reader.cc

Here we are calling PreBuffer multiple times. Each time we call it we haven't yet finished reading from the time before. Will this work?

You are right Weston, doesn't work.
It seems we need to wait to finish the prebuffering process (for each parquet row group) before we read each row group; I sent new changes, the relevant part about this is in RowGroupGenerator::FetchNext (there is a future::wait there)

westonpace · 2022-09-27T05:57:38Z

cpp/src/parquet/arrow/reader.cc

This seems a bit over-complicated. Can we push the extra calls to PreBuffer down a layer into the RowGroupGenerator itself?

westonpace · 2022-09-27T05:59:51Z

cpp/src/parquet/arrow/arrow_reader_writer_test.cc

Do you have a test case where we measure that we don't actually keep the RAM when reading the entire file?

Thanks Weston, do you have an idea about how to make such test? I think we will still need to keep data in RAM right? given that ReadRangeCache::read is retaining the contents of the buffered data until the reader expires.
Do you think is a good idea to use TestArrowReadWrite.GetRecordBatchGenerator as base for writing the new test you are requesting?

disregard this please, I'll try to use the memory pool for getting the memory stats.

I added a new test called TestArrowReadWrite.ReadShouldNotRetainRam

aucahuasi · 2022-10-14T09:07:11Z

I also experimented with the python script provided in this related/duplicated Jira ticket: https://issues.apache.org/jira/browse/ARROW-17590
I ran 3 times the python script using this PR and I notice that we are using less memory for reading prebuffered files now, here are the details:

================================
Arrow version: master
pre_buffer=True

0 rss:  88.390625 MB
1 rss:  1374.640625 MB
pa.total_allocated_bytes 43.61480712890625 MB dt.nbytes 0.0014410018920898438 MB
       c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27  ...  c572  c573  c574  c575  c576  c577  c578  c579  c580  c581  c582  c583  c584  c585  c586  c587  c588  c589  c590  c591  c592  c593  c594  c595  c596  c597  c598  c599
0  125000                                                                                                     ...  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None

[1 rows x 600 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Columns: 600 entries, c0 to c599
dtypes: object(600)
memory usage: 23.9 KB
2 rss:  1294.765625 MB
3 rss:  1294.765625 MB
pyarrow 10.0.0.dev4070+gfb087669a.d20221013 pandas 1.5.0 numpy 1.23.3

================================
Arrow version: this PR
pre_buffer=True
1st run

0 rss:  87.5 MB
1 rss:  728.921875 MB
pa.total_allocated_bytes 9.8636474609375 MB dt.nbytes 0.0014410018920898438 MB
       c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27  ...  c572  c573  c574  c575  c576  c577  c578  c579  c580  c581  c582  c583  c584  c585  c586  c587  c588  c589  c590  c591  c592  c593  c594  c595  c596  c597  c598  c599
0  125000                                                                                                     ...  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None

[1 rows x 600 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Columns: 600 entries, c0 to c599
dtypes: object(600)
memory usage: 23.9 KB
2 rss:  731.375 MB
3 rss:  731.375 MB
pyarrow 10.0.0.dev4070+gfb087669a.d20221013 pandas 1.5.0 numpy 1.23.3

================================
Arrow version: this PR
pre_buffer=True
2nd run

0 rss:  87.703125 MB
1 rss:  729.5 MB
pa.total_allocated_bytes 9.8636474609375 MB dt.nbytes 0.0014410018920898438 MB
       c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27  ...  c572  c573  c574  c575  c576  c577  c578  c579  c580  c581  c582  c583  c584  c585  c586  c587  c588  c589  c590  c591  c592  c593  c594  c595  c596  c597  c598  c599
0  125000                                                                                                     ...  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None

[1 rows x 600 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Columns: 600 entries, c0 to c599
dtypes: object(600)
memory usage: 23.9 KB
2 rss:  610.328125 MB
3 rss:  610.328125 MB
pyarrow 10.0.0.dev4070+gfb087669a.d20221013 pandas 1.5.0 numpy 1.23.3

================================
Arrow version: this PR
pre_buffer=True
3rd run

0 rss:  87.484375 MB
1 rss:  729.859375 MB
pa.total_allocated_bytes 9.8636474609375 MB dt.nbytes 0.0014410018920898438 MB
       c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27  ...  c572  c573  c574  c575  c576  c577  c578  c579  c580  c581  c582  c583  c584  c585  c586  c587  c588  c589  c590  c591  c592  c593  c594  c595  c596  c597  c598  c599
0  125000                                                                                                     ...  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None

[1 rows x 600 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Columns: 600 entries, c0 to c599
dtypes: object(600)
memory usage: 23.9 KB
2 rss:  732.34375 MB
3 rss:  732.34375 MB
pyarrow 10.0.0.dev4072+gc32f988f5.d20221014 pandas 1.5.0 numpy 1.23.3

================================
Arrow version: master
pre_buffer=False

0 rss:  87.828125 MB
1 rss:  1385.734375 MB
pa.total_allocated_bytes 9.7957763671875 MB dt.nbytes 0.0014410018920898438 MB
       c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27  ...  c572  c573  c574  c575  c576  c577  c578  c579  c580  c581  c582  c583  c584  c585  c586  c587  c588  c589  c590  c591  c592  c593  c594  c595  c596  c597  c598  c599
0  125000                                                                                                     ...  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None

[1 rows x 600 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Columns: 600 entries, c0 to c599
dtypes: object(600)
memory usage: 23.9 KB
2 rss:  1538.9375 MB
3 rss:  1546.4375 MB
pyarrow 10.0.0.dev4070+gfb087669a.d20221013 pandas 1.5.0 numpy 1.23.3

================================
Arrow version: this PR
pre_buffer=False

0 rss:  87.8125 MB
1 rss:  1431.546875 MB
pa.total_allocated_bytes 9.7957763671875 MB dt.nbytes 0.0014410018920898438 MB
       c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 c21 c22 c23 c24 c25 c26 c27  ...  c572  c573  c574  c575  c576  c577  c578  c579  c580  c581  c582  c583  c584  c585  c586  c587  c588  c589  c590  c591  c592  c593  c594  c595  c596  c597  c598  c599
0  125000                                                                                                     ...  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None  None

[1 rows x 600 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Columns: 600 entries, c0 to c599
dtypes: object(600)
memory usage: 23.9 KB
2 rss:  1570.390625 MB
3 rss:  1573.8125 MB
pyarrow 10.0.0.dev4070+gfb087669a.d20221013 pandas 1.5.0 numpy 1.23.3

aucahuasi · 2022-10-14T22:36:55Z

Follow up ticket for the IPC file reader:
https://issues.apache.org/jira/browse/ARROW-18065

lidavidm · 2022-10-15T13:49:39Z

cpp/src/parquet/arrow/reader.cc

Hmm, why are we synchronously blocking, then attaching a callback to the future in the next step? Something seems off here

Thanks David, I got this error when I don't use wait:
ReadRangeCache did not find matching cache entry
It's like the parquet_reader()->WhenBuffered call is not really waiting to have the buffer ready before read the rowgroup.
I'll investigate more what could be the issue here.

It seems -somehow- when for the same rowgroup we trigger ReadOneRowGroup and it starts reading, the cache entries are empty (even when I could confirm that it called PreBuffer before)
I did a simple change: when we don't transfer the future using
if (cpu_executor_) ready = cpu_executor_->TransferAlways(ready);
it works without wait.

I think that when we transfer the future, the PreBuffer that it's running now (in a different thread) cannot finish before the future triggers ReadOneRowGroup (because the future was already transferred into a different thread)
So perhaps we just need a way to make sure PreBuffer gets called before in the same thread of the future.

I sent new changes to avoid the use of Future::Wait (in the main thread) and to run the prebuffering and read operations together in the same thread context (one after the other).

move PreBuffer down into the RowGroupGenerator and draft the new test Don't concatenate RecordBatchGenerator and add more ideas for the test Avoid keeping the RAM when using the RecordBatchGenerator: Use MakeMappedGenerator instead of MakeConcatenatedGenerator to create the asyncgenerator. cleaning ... format concatenate the async generator when we are using prebuffering format (I forgot to update archery linters to clang-tools 14) prefer concatenation instead to not break parquet pytest, fix some issues using arrow::Future force wait the buffering finish before read (all of this is async)

aucahuasi · 2022-10-17T21:12:36Z

There are some CI errors that seems unrelated and also the CI job for macOS 11 was canceled (not sure why).
I built this branch on windows 10 (msys2-mingw64) and was able to run some tests without problems:

debug/arrow-dataset-file-parquet-test.exe
[----------] Global test environment tear-down
[==========] 44 tests from 3 test suites ran. (3093 ms total)
[  PASSED  ] 44 tests.

debug/parquet-reader-test.exe
[----------] Global test environment tear-down
[==========] 76 tests from 26 test suites ran. (518 ms total)
[  PASSED  ] 71 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] TestDumpWithLocalFile.DumpOutput
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] TestBooleanRLE.TestBooleanScanner
[  FAILED  ] TestBooleanRLE.TestBatchRead
[  FAILED  ] TestTextDeltaLengthByteArray.TestTextScanner
[  FAILED  ] TestTextDeltaLengthByteArray.TestBatchRead
 4 FAILED TESTS

debug/parquet-arrow-internals-test.exe
[----------] Global test environment tear-down
[==========] 78 tests from 2 test suites ran. (76 ms total)
[  PASSED  ] 78 tests.

debug/arrow-io-buffered-test.exe
[----------] Global test environment tear-down
[==========] 22 tests from 3 test suites ran. (137 ms total)
[  PASSED  ] 22 tests.

aucahuasi · 2022-10-17T21:55:11Z

I sent a minor change and I noticed again that it seems the CI job for windows msys2-mingw64 is not stable enough (it could not even start building):
error: failed retrieving file 'mingw-w64-x86_64-grpc-1.50.0-1-any.pkg.tar.zst' from mirror.msys2.org : Operation too slow. Less than 1 bytes/sec transferred the last 10 seconds

aucahuasi · 2022-10-18T00:19:00Z

I built on windows again, this time using msys2-mingw32 toolchain and I was able to run the unit tests without issues.

lidavidm · 2022-10-19T11:50:50Z

cpp/src/parquet/arrow/reader.cc

+        END_PARQUET_CATCH_EXCEPTIONS
+        auto wait_buffer =
+            reader->parquet_reader()->WhenBuffered({row_group}, column_indices);
+        wait_buffer.Wait();


Hmm…I think anything calling Wait() in a callback/async context is not going to be right.

I think the issue is that the pre-buffer code doesn't handle concurrent use. The Wait() is effectively just working around that by blocking the thread so that there's no sharing. However, if you attach a reentrant readahead generator to it, I'd guess it'd still fail. So I think either the internals should be refactored so that it does handle concurrent use, or we should just create a separate ReadRangeCache per row group. (The advantage of that is that you'd have a harder bound on memory usage.)

However either way this loses 'nice' properties of the original, buffer-entire-file approach (e.g. small row groups can get combined together for I/O). IMO, the longer term solution would be to disentangle the 'cache' and 'coalesce' behaviors (and possibly even remove the 'cache' behavior, which may make more sense as a wrapper over RandomAccessFile?) and try the approach proposed in the original JIRA, which would be to coalesce ranges, then track when ranges are actually read and remove the buffer from the coalescer once all ranges mapping to a given buffer are read. (The buffer may be kept alive downstream due to shared usage, though.) Or maybe that's still overly fancy.

Thanks David, I'll close this PR in favor of https://issues.apache.org/jira/browse/ARROW-18113

aucahuasi · 2022-10-20T18:15:46Z

Closing this PR in favor of https://issues.apache.org/jira/browse/ARROW-18113

github-actions bot added Component: C++ Component: Parquet labels Sep 24, 2022

westonpace reviewed Sep 27, 2022

View reviewed changes

aucahuasi force-pushed the ARROW-17599-fix-ReadRangeCache branch from f74bf85 to f7a07ba Compare October 14, 2022 08:44

aucahuasi force-pushed the ARROW-17599-fix-ReadRangeCache branch 2 times, most recently from 4a9780d to 25b12ce Compare October 14, 2022 22:09

aucahuasi marked this pull request as ready for review October 14, 2022 22:35

lidavidm reviewed Oct 15, 2022

View reviewed changes

aucahuasi force-pushed the ARROW-17599-fix-ReadRangeCache branch from 25b12ce to e8186e9 Compare October 17, 2022 11:24

aucahuasi force-pushed the ARROW-17599-fix-ReadRangeCache branch from e8186e9 to 2548cab Compare October 17, 2022 11:54

move the error catch macros inside the lambda

1077dd5

aucahuasi requested review from lidavidm and westonpace and removed request for lidavidm and westonpace October 19, 2022 00:47

lidavidm reviewed Oct 19, 2022

View reviewed changes

aucahuasi closed this Oct 20, 2022

This was referenced Nov 29, 2022

[C++] Implement a read range process without caching #33311

Closed

[C++] ReadRangeCache should not retain data after read #32846

Open

ARROW-17599: [C++] Change the way how arrow reads parquet buffered files #14226

ARROW-17599: [C++] Change the way how arrow reads parquet buffered files #14226

Uh oh!

Conversation

aucahuasi commented Sep 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 24, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aucahuasi commented Oct 14, 2022

Uh oh!

aucahuasi commented Oct 14, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aucahuasi Oct 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aucahuasi Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aucahuasi commented Oct 17, 2022

Uh oh!

aucahuasi commented Oct 17, 2022

Uh oh!

aucahuasi commented Oct 18, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aucahuasi commented Oct 20, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aucahuasi commented Sep 24, 2022 •

edited

Loading

aucahuasi Oct 16, 2022 •

edited

Loading

aucahuasi Oct 17, 2022 •

edited

Loading