Skip to content
This repository was archived by the owner on May 10, 2024. It is now read-only.

Conversation

@majetideepak
Copy link

Implemented dictionary fallback encoding
Added tests
Added a fast path to serialize data pages

int64_t metadata_num_values() const { return metadata_accessor_->num_values(); }
int64_t metadata_num_values() {
auto metadata_accessor =
ColumnChunkMetaData::Make(reinterpret_cast<uint8_t*>(&thrift_metadata_));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metadata accessor needs to be built lazily.

Copy link
Member

@wesm wesm Sep 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this in a comment? (Also, why must it be lazily constructed?)

@wesm
Copy link
Member

wesm commented Sep 13, 2016

This looks good. I think for even integer / float data with a lot of repeated values that dictionary encoding can have a lot of perf benefits. Will be interesting to do some benchmarking to see how much

void TypedColumnWriter<Type>::VerifyDictionaryFallback() {
auto dict_encoder = static_cast<DictEncoder<Type>*>(current_encoder_.get());
if (dict_encoder->dict_encoded_size() >= properties_->dictionary_pagesize()) {
WriteDictionaryPage();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we extract the following lines into a function FlushBufferedPages ?

if (this->type_num() != Type::BOOLEAN) {
// There are 3 encodings (RLE, PLAIN_DICTIONARY, PLAIN) in a fallback case
ASSERT_EQ(3, this->metadata_num_encodings());
ASSERT_EQ(Encoding::PLAIN_DICTIONARY, encodings[1]);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment to explicitly verify the encodings did uncover a bug in the metadata writer

virtual std::shared_ptr<Buffer> GetValuesBuffer() = 0;
/**
* Serializes Dictionary Page if enabled
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment style is inconsistent, since we have been using // we should probably stick to that for now (https://google.github.io/styleguide/cppguide.html#Comment_Style).

@wesm
Copy link
Member

wesm commented Sep 15, 2016

+1, thank you!

@asfgit asfgit closed this in c6f5ebe Sep 15, 2016
for (int round = 0; round < num_batches; round++) {
int64_t offset = round * write_batch_size;
WriteMiniBatch(
write_batch_size, &def_levels[offset], &rep_levels[offset], &values[offset]);
Copy link
Contributor

@lomereiter lomereiter Sep 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too late but: values don't include nulls, so this chunking can easily lead to a segfault in case nulls are present in the full batch. (Tried to rebase my PR after this one was merged, got failing tests)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, sorry I missed that. WriteMiniBatch needs to return the value offset. @majetideepak can you open a JIRA? Thanks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this and add a test case immediately. Sorry for missing this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants