Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Jul 31, 2019

Introduces auxiliary internal SchemaManifest and SchemaField data structures.

This also permits dictionary-encoded subfields in a slightly more principled way (the dictionary type creation is resolved one time, so this removes the FixSchema hacks that were there before). I rewrote the nested schema conversion logic to hopefully be slightly easier to follow though it could still use some work. I added comments within to explain the 3 different styles of list encoding

There are a couple of API changes:

  • The FileReader::GetSchema(indices, &schema) method has been removed. The way that "projected" schemas were being constructed was pretty hacky, and this function is non-essential to the operation of the class. I had to remove bindings in the GLib and R libraries for this function, but as far as I can tell these bindings were non-essential to operation, and were added only because the function was there to wrap.
  • Added FileWriter::Make factory method, making constructor private

This patch was pretty unpleasant to do -- it removes some hacky functions used to create Arrow fields with leaf nodes trimmed. There is little functional change; it is an attempt to bring a cleaner structure for full-fledged nested data reading

I'm going to get on with seeing through user-facing dictionary-encoding functionality in Python

@codecov-io
Copy link

codecov-io commented Jul 31, 2019

Codecov Report

Merging #4971 into master will increase coverage by 1.6%.
The diff coverage is 91.34%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4971      +/-   ##
==========================================
+ Coverage   87.55%   89.15%    +1.6%     
==========================================
  Files        1000      723     -277     
  Lines      142303   101705   -40598     
  Branches     1418        0    -1418     
==========================================
- Hits       124589    90678   -33911     
+ Misses      17352    11027    -6325     
+ Partials      362        0     -362
Impacted Files Coverage Δ
cpp/src/parquet/arrow/writer.h 100% <ø> (ø) ⬆️
cpp/src/parquet/arrow/schema.cc 93.21% <ø> (+2.24%) ⬆️
cpp/src/parquet/arrow/reader.h 91.66% <ø> (ø) ⬆️
cpp/src/parquet/schema.h 100% <ø> (ø) ⬆️
cpp/src/parquet/schema-internal.h 100% <ø> (+3.44%) ⬆️
r/R/arrowExports.R 71.12% <100%> (+0.29%) ⬆️
cpp/src/parquet/file_writer.cc 81.21% <100%> (-1.02%) ⬇️
r/src/arrowExports.cpp 70.91% <100%> (+0.33%) ⬆️
r/src/parquet.cpp 70.58% <100%> (+7.43%) ⬆️
r/R/parquet.R 72% <50%> (+5.33%) ⬆️
... and 305 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 803dd89...a5187c2. Read the comment docs.

@wesm
Copy link
Member Author

wesm commented Aug 1, 2019

Looks like I have to rebase. Any thoughts about this? I need to stack my patches on top

@fsaintjacques
Copy link
Contributor

@wesm I have a pending review I started yesterday night, let me complete this morning (lot to review!)

@wesm
Copy link
Member Author

wesm commented Aug 1, 2019

Okay. Quite a bit code has been moved around relatively unmodified so please refrain from commenting on old code. Thanks =)

@fsaintjacques
Copy link
Contributor

I'll try to, github review interface does not make this easy to do :)

@hatemhelal
Copy link
Contributor

I had a read through and looked good to me. I think this or your previous refactoring PR might have fixed ARROW-6005. I'll plan to use that issue to add some unittests for that particular workflow.

@@ -178,16 +685,13 @@ std::shared_ptr<ChunkedArray> CastChunksTo(
}

Status TransferDictionary(RecordReader* reader,
const std::shared_ptr<DataType>& logical_value_type,
std::shared_ptr<DataType> logical_value_type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this change from passing by const ref to value threw me a bit. Could warrant a comment? Or does it make sense to keep the signature as accepting a const ref and move this:

::arrow::dictionary(::arrow::int32(), logical_value_type)

into TransferDictionary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reverting this change, it was a leftover from mid-way through the patch

return Status::OK();
}

inline bool EndswithTuple(const std::string& str) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code was moved from elsewhere but you could replace this with boost::algorithm::ends_with

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, doing that then =)


// ----------------------------------------------------------------------
// FileReaderImpl forward declaration

std::vector<int> Arange(int length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::vector<int> Arange(int length) {
static std::vector<int> Arange(int length) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to anonymous namespace


class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader {
public:
explicit RowGroupRecordBatchReader(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
explicit RowGroupRecordBatchReader(
RowGroupRecordBatchReader(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Status GetColumnField(int column_index, const SchemaField** out) const {
auto it = column_index_to_field.find(column_index);
if (it == column_index_to_field.end()) {
return Status::KeyError("Column index ", column_index, "is not a leaf");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a misleading error message since non-leaf schema fields don't have indices

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tricky thing to have any kind of error here, since users should not see this. I am changing to say that the schema manifest may be malformed

::arrow::MemoryPool* pool, std::shared_ptr<Array>* out) {
if (field->type()->num_children() > 0 && arr->type_id() == ::arrow::Type::DICTIONARY) {
// XXX(wesm): Handling of nested types and dictionary encoding needs to be
// significantly refactored
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? Is this in reference to #4956 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem that I'm referencing here is fixed by this patch, I will remove the comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RETURN_NOT_OK(valid_bits_builders[j]->Append(false));
null_counts[j]++;
break;
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else {
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is old code

static_cast<int32_t>(offset_builders[j + 1]->length())));
}

if (((empty_def_level[j] - 1) == def_levels[i]) && (nullable[j])) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (((empty_def_level[j] - 1) == def_levels[i]) && (nullable[j])) {
if (nullable[j] && empty_def_level[j] - 1 == def_levels[i]) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old code


// Walk downwards to extract nullability
std::vector<bool> nullable;
std::vector<std::shared_ptr<::arrow::Int32Builder>> offset_builders;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be easier than offset_builders and valid_bits_builders:

MakeBuilder(pool, list(list(list(/*...*/null()/*...*/))), &root_builder);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old code

wesm added 2 commits August 1, 2019 15:45
Initial ugly refactoring

More ugly refactoring. Not quite there yet

More non-working refactoring

Schema tests pass again

Fix node filtering logic

C++ unit tests passing again, create dictionary subfields

Move symbols to reader.h

Remove bindings for removed GetSchema API

Remove API

Fix R bindings
@wesm wesm force-pushed the parquet-arrow-schema-tree branch from a5187c2 to e1f19c0 Compare August 1, 2019 21:14
@wesm
Copy link
Member Author

wesm commented Aug 1, 2019

Thank you for the feedback. I'll merge this once the build passes

@wesm
Copy link
Member Author

wesm commented Aug 1, 2019

Travis CI: https://travis-ci.org/wesm/arrow/builds/566680039
Appveyor: https://ci.appveyor.com/project/wesm/arrow/builds/26412833

The Appveyor build isn't done yet so I'll wait a little bit to make sure MinGW is ok

@wesm
Copy link
Member Author

wesm commented Aug 2, 2019

One of the Windows builds is hanging, it doesn't appear related to this patch because I found it in a build in master. I opened

https://issues.apache.org/jira/browse/ARROW-6108

Merging this in the meantime

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants