-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6077: [C++][Parquet] Build Arrow "schema tree" from Parquet schema to help with nested data implementation #4971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4971 +/- ##
==========================================
+ Coverage 87.55% 89.15% +1.6%
==========================================
Files 1000 723 -277
Lines 142303 101705 -40598
Branches 1418 0 -1418
==========================================
- Hits 124589 90678 -33911
+ Misses 17352 11027 -6325
+ Partials 362 0 -362
Continue to review full report at Codecov.
|
|
Looks like I have to rebase. Any thoughts about this? I need to stack my patches on top |
|
@wesm I have a pending review I started yesterday night, let me complete this morning (lot to review!) |
|
Okay. Quite a bit code has been moved around relatively unmodified so please refrain from commenting on old code. Thanks =) |
|
I'll try to, github review interface does not make this easy to do :) |
|
I had a read through and looked good to me. I think this or your previous refactoring PR might have fixed ARROW-6005. I'll plan to use that issue to add some unittests for that particular workflow. |
| @@ -178,16 +685,13 @@ std::shared_ptr<ChunkedArray> CastChunksTo( | |||
| } | |||
|
|
|||
| Status TransferDictionary(RecordReader* reader, | |||
| const std::shared_ptr<DataType>& logical_value_type, | |||
| std::shared_ptr<DataType> logical_value_type, | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this change from passing by const ref to value threw me a bit. Could warrant a comment? Or does it make sense to keep the signature as accepting a const ref and move this:
::arrow::dictionary(::arrow::int32(), logical_value_type)
into TransferDictionary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reverting this change, it was a leftover from mid-way through the patch
| return Status::OK(); | ||
| } | ||
|
|
||
| inline bool EndswithTuple(const std::string& str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code was moved from elsewhere but you could replace this with boost::algorithm::ends_with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, doing that then =)
cpp/src/parquet/arrow/reader.cc
Outdated
|
|
||
| // ---------------------------------------------------------------------- | ||
| // FileReaderImpl forward declaration | ||
|
|
||
| std::vector<int> Arange(int length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| std::vector<int> Arange(int length) { | |
| static std::vector<int> Arange(int length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to anonymous namespace
cpp/src/parquet/arrow/reader.cc
Outdated
|
|
||
| class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader { | ||
| public: | ||
| explicit RowGroupRecordBatchReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| explicit RowGroupRecordBatchReader( | |
| RowGroupRecordBatchReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| Status GetColumnField(int column_index, const SchemaField** out) const { | ||
| auto it = column_index_to_field.find(column_index); | ||
| if (it == column_index_to_field.end()) { | ||
| return Status::KeyError("Column index ", column_index, "is not a leaf"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a misleading error message since non-leaf schema fields don't have indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tricky thing to have any kind of error here, since users should not see this. I am changing to say that the schema manifest may be malformed
| ::arrow::MemoryPool* pool, std::shared_ptr<Array>* out) { | ||
| if (field->type()->num_children() > 0 && arr->type_id() == ::arrow::Type::DICTIONARY) { | ||
| // XXX(wesm): Handling of nested types and dictionary encoding needs to be | ||
| // significantly refactored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate? Is this in reference to #4956 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem that I'm referencing here is fixed by this patch, I will remove the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened https://issues.apache.org/jira/browse/ARROW-6105 so this gets tested
| RETURN_NOT_OK(valid_bits_builders[j]->Append(false)); | ||
| null_counts[j]++; | ||
| break; | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } else { | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is old code
| static_cast<int32_t>(offset_builders[j + 1]->length()))); | ||
| } | ||
|
|
||
| if (((empty_def_level[j] - 1) == def_levels[i]) && (nullable[j])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (((empty_def_level[j] - 1) == def_levels[i]) && (nullable[j])) { | |
| if (nullable[j] && empty_def_level[j] - 1 == def_levels[i]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old code
|
|
||
| // Walk downwards to extract nullability | ||
| std::vector<bool> nullable; | ||
| std::vector<std::shared_ptr<::arrow::Int32Builder>> offset_builders; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be easier than offset_builders and valid_bits_builders:
MakeBuilder(pool, list(list(list(/*...*/null()/*...*/))), &root_builder);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old code
Initial ugly refactoring More ugly refactoring. Not quite there yet More non-working refactoring Schema tests pass again Fix node filtering logic C++ unit tests passing again, create dictionary subfields Move symbols to reader.h Remove bindings for removed GetSchema API Remove API Fix R bindings
a5187c2 to
e1f19c0
Compare
|
Thank you for the feedback. I'll merge this once the build passes |
|
Travis CI: https://travis-ci.org/wesm/arrow/builds/566680039 The Appveyor build isn't done yet so I'll wait a little bit to make sure MinGW is ok |
|
One of the Windows builds is hanging, it doesn't appear related to this patch because I found it in a build in master. I opened https://issues.apache.org/jira/browse/ARROW-6108 Merging this in the meantime |
Introduces auxiliary internal
SchemaManifestandSchemaFielddata structures.This also permits dictionary-encoded subfields in a slightly more principled way (the dictionary type creation is resolved one time, so this removes the
FixSchemahacks that were there before). I rewrote the nested schema conversion logic to hopefully be slightly easier to follow though it could still use some work. I added comments within to explain the 3 different styles of list encodingThere are a couple of API changes:
FileReader::GetSchema(indices, &schema)method has been removed. The way that "projected" schemas were being constructed was pretty hacky, and this function is non-essential to the operation of the class. I had to remove bindings in the GLib and R libraries for this function, but as far as I can tell these bindings were non-essential to operation, and were added only because the function was there to wrap.FileWriter::Makefactory method, making constructor privateThis patch was pretty unpleasant to do -- it removes some hacky functions used to create Arrow fields with leaf nodes trimmed. There is little functional change; it is an attempt to bring a cleaner structure for full-fledged nested data reading
I'm going to get on with seeing through user-facing dictionary-encoding functionality in Python