PARQUET-687: C++: Switch to PLAIN encoding if dictionary grows too large #157

majetideepak · 2016-09-12T16:53:25Z

Implemented dictionary fallback encoding
Added tests
Added a fast path to serialize data pages

majetideepak · 2016-09-12T16:54:51Z

src/parquet/column/column-writer-test.cc

-  int64_t metadata_num_values() const { return metadata_accessor_->num_values(); }
+  int64_t metadata_num_values() {
+    auto metadata_accessor =
+        ColumnChunkMetaData::Make(reinterpret_cast<uint8_t*>(&thrift_metadata_));


Metadata accessor needs to be built lazily.

Put this in a comment? (Also, why must it be lazily constructed?)

wesm · 2016-09-13T03:34:14Z

This looks good. I think for even integer / float data with a lot of repeated values that dictionary encoding can have a lot of perf benefits. Will be interesting to do some benchmarking to see how much

xhochy · 2016-09-13T11:20:38Z

src/parquet/column/writer.cc

+void TypedColumnWriter<Type>::VerifyDictionaryFallback() {
+  auto dict_encoder = static_cast<DictEncoder<Type>*>(current_encoder_.get());
+  if (dict_encoder->dict_encoded_size() >= properties_->dictionary_pagesize()) {
+    WriteDictionaryPage();


Could we extract the following lines into a function FlushBufferedPages ?

majetideepak · 2016-09-13T15:53:46Z

src/parquet/column/column-writer-test.cc

  if (this->type_num() != Type::BOOLEAN) {
-    // There are 3 encodings (RLE, PLAIN_DICTIONARY, PLAIN) in a fallback case
-    ASSERT_EQ(3, this->metadata_num_encodings());
+    ASSERT_EQ(Encoding::PLAIN_DICTIONARY, encodings[1]);


This comment to explicitly verify the encodings did uncover a bug in the metadata writer

wesm · 2016-09-14T02:53:37Z

src/parquet/column/writer.h

  virtual std::shared_ptr<Buffer> GetValuesBuffer() = 0;
+  /**
+   * Serializes Dictionary Page if enabled
+   */


Comment style is inconsistent, since we have been using // we should probably stick to that for now (https://google.github.io/styleguide/cppguide.html#Comment_Style).

wesm · 2016-09-15T04:04:08Z

+1, thank you!

lomereiter · 2016-09-15T07:36:02Z

src/parquet/column/writer.h

+  for (int round = 0; round < num_batches; round++) {
+    int64_t offset = round * write_batch_size;
+    WriteMiniBatch(
+        write_batch_size, &def_levels[offset], &rep_levels[offset], &values[offset]);


Too late but: values don't include nulls, so this chunking can easily lead to a segfault in case nulls are present in the full batch. (Tried to rebase my PR after this one was merged, got failing tests)

You're right, sorry I missed that. WriteMiniBatch needs to return the value offset. @majetideepak can you open a JIRA? Thanks

I will fix this and add a test case immediately. Sorry for missing this.

JIRA: https://issues.apache.org/jira/browse/PARQUET-719
and
PR: #160

Deepak Majeti added 4 commits September 12, 2016 11:42

added dictionary fallback support with tests

84f360d

clang format

54af38a

Add all types to the test

dd0cc7e

minor changes

312bad8

majetideepak reviewed Sep 12, 2016
View reviewed changes

xhochy reviewed Sep 13, 2016
View reviewed changes

added comments and fixed review suggestions

da46033

majetideepak reviewed Sep 13, 2016
View reviewed changes

clang format

eac9114

wesm reviewed Sep 14, 2016
View reviewed changes

Deepak Majeti added 2 commits September 14, 2016 10:15

modify comment style

c498aeb

minor comment fix

6f51df6

asfgit closed this in c6f5ebe Sep 15, 2016

lomereiter reviewed Sep 15, 2016

View reviewed changes

majetideepak deleted the PARQUET-717 branch September 15, 2016 15:37

asfimport mentioned this pull request Jun 23, 2024

[C++][Parquet] Fix WriterBatch API to handle NULL values apache/arrow#42536

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-687: C++: Switch to PLAIN encoding if dictionary grows too large #157

PARQUET-687: C++: Switch to PLAIN encoding if dictionary grows too large #157

Uh oh!

majetideepak commented Sep 12, 2016

Uh oh!

majetideepak Sep 12, 2016

Uh oh!

wesm Sep 13, 2016 •

edited

Loading

Uh oh!

wesm commented Sep 13, 2016

Uh oh!

xhochy Sep 13, 2016

Uh oh!

majetideepak Sep 13, 2016

Uh oh!

wesm Sep 14, 2016

Uh oh!

wesm commented Sep 15, 2016

Uh oh!

lomereiter Sep 15, 2016 •

edited

Loading

Uh oh!

wesm Sep 15, 2016

Uh oh!

majetideepak Sep 15, 2016

Uh oh!

majetideepak Sep 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

PARQUET-687: C++: Switch to PLAIN encoding if dictionary grows too large #157

PARQUET-687: C++: Switch to PLAIN encoding if dictionary grows too large #157

Uh oh!

Conversation

majetideepak commented Sep 12, 2016

Uh oh!

majetideepak Sep 12, 2016

Choose a reason for hiding this comment

Uh oh!

wesm Sep 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Sep 13, 2016

Uh oh!

xhochy Sep 13, 2016

Choose a reason for hiding this comment

Uh oh!

majetideepak Sep 13, 2016

Choose a reason for hiding this comment

Uh oh!

wesm Sep 14, 2016

Choose a reason for hiding this comment

Uh oh!

wesm commented Sep 15, 2016

Uh oh!

lomereiter Sep 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm Sep 15, 2016

Choose a reason for hiding this comment

Uh oh!

majetideepak Sep 15, 2016

Choose a reason for hiding this comment

Uh oh!

majetideepak Sep 15, 2016

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wesm Sep 13, 2016 •

edited

Loading

lomereiter Sep 15, 2016 •

edited

Loading