-
Notifications
You must be signed in to change notification settings - Fork 191
PARQUET-687: C++: Switch to PLAIN encoding if dictionary grows too large #157
Conversation
| int64_t metadata_num_values() const { return metadata_accessor_->num_values(); } | ||
| int64_t metadata_num_values() { | ||
| auto metadata_accessor = | ||
| ColumnChunkMetaData::Make(reinterpret_cast<uint8_t*>(&thrift_metadata_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metadata accessor needs to be built lazily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put this in a comment? (Also, why must it be lazily constructed?)
|
This looks good. I think for even integer / float data with a lot of repeated values that dictionary encoding can have a lot of perf benefits. Will be interesting to do some benchmarking to see how much |
| void TypedColumnWriter<Type>::VerifyDictionaryFallback() { | ||
| auto dict_encoder = static_cast<DictEncoder<Type>*>(current_encoder_.get()); | ||
| if (dict_encoder->dict_encoded_size() >= properties_->dictionary_pagesize()) { | ||
| WriteDictionaryPage(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we extract the following lines into a function FlushBufferedPages ?
| if (this->type_num() != Type::BOOLEAN) { | ||
| // There are 3 encodings (RLE, PLAIN_DICTIONARY, PLAIN) in a fallback case | ||
| ASSERT_EQ(3, this->metadata_num_encodings()); | ||
| ASSERT_EQ(Encoding::PLAIN_DICTIONARY, encodings[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment to explicitly verify the encodings did uncover a bug in the metadata writer
src/parquet/column/writer.h
Outdated
| virtual std::shared_ptr<Buffer> GetValuesBuffer() = 0; | ||
| /** | ||
| * Serializes Dictionary Page if enabled | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment style is inconsistent, since we have been using // we should probably stick to that for now (https://google.github.io/styleguide/cppguide.html#Comment_Style).
|
+1, thank you! |
| for (int round = 0; round < num_batches; round++) { | ||
| int64_t offset = round * write_batch_size; | ||
| WriteMiniBatch( | ||
| write_batch_size, &def_levels[offset], &rep_levels[offset], &values[offset]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too late but: values don't include nulls, so this chunking can easily lead to a segfault in case nulls are present in the full batch. (Tried to rebase my PR after this one was merged, got failing tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, sorry I missed that. WriteMiniBatch needs to return the value offset. @majetideepak can you open a JIRA? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix this and add a test case immediately. Sorry for missing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JIRA: https://issues.apache.org/jira/browse/PARQUET-719
and
PR: #160
Implemented dictionary fallback encoding
Added tests
Added a fast path to serialize data pages