ARROW-6042: [C++][Parquet] Add Dictionary32Builder that always returns 32-bit dictionary indices #4956

wesm · 2019-07-27T00:37:15Z

Without this, DictionaryArrays produced by different parts of a Parquet file could have different index types depending on the cardinality of each decoded part.

Refactors DictionaryBuilder (which uses AdaptiveIntBuilder) and Dictionary32Builder (which uses Int32Builder) to have a common base class.

…lass

codecov-io · 2019-07-27T03:33:59Z

Codecov Report

Merging #4956 into master will increase coverage by 1.62%.
The diff coverage is 97.4%.

@@            Coverage Diff             @@
##           master    #4956      +/-   ##
==========================================
+ Coverage   87.49%   89.11%   +1.62%     
==========================================
  Files         998      721     -277     
  Lines      141784   101687   -40097     
  Branches     1418        0    -1418     
==========================================
- Hits       124058    90623   -33435     
+ Misses      17364    11064    -6300     
+ Partials      362        0     -362

Impacted Files	Coverage Δ
cpp/src/parquet/encoding-test.cc	`100% <ø> (ø)`	⬆️
cpp/src/parquet/encoding.h	`97.82% <ø> (ø)`	⬆️
cpp/src/parquet/column_reader.cc	`88.93% <ø> (ø)`	⬆️
cpp/src/arrow/array-dict-test.cc	`95.52% <100%> (+0.13%)`	⬆️
cpp/src/parquet/arrow/arrow-reader-writer-test.cc	`93.71% <100%> (ø)`	⬆️
cpp/src/parquet/encoding.cc	`93.73% <100%> (ø)`	⬆️
cpp/src/arrow/array/builder_dict.h	`88.57% <95.12%> (+0.33%)`	⬆️
cpp/src/arrow/array/builder_adaptive.h	`93.75% <0%> (-3.13%)`	⬇️
go/arrow/ipc/writer.go
... and 278 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38b0176...5210b30. Read the comment docs.

xhochy

+1, LGTM

pitrou

+1. I'm curious, does this make parquet conversion faster?

wesm · 2019-07-29T14:40:52Z

@pitrou I haven't done comprehensive benchmarks yet but the main benefit is actually memory use (see example of user running into runaway memory usage on a highly encoded file at https://issues.apache.org/jira/browse/ARROW-5993). Performance should be better, too. I'll run benchmarks to illustrate

wesm · 2019-07-29T14:41:24Z

BTW @pitrou this patch is also necessary if you want to add dictionary encoding to the CSV reader (because each chunk may have a different dictionary cardinality)

wesm added 3 commits July 26, 2019 18:58

Implement 32-bit-only variant of DictionaryBuilder, use common base c…

c2e40ed

…lass

Change Parquet to use 32-bit dictionary builder

8603ded

Make implementation templates internal

5210b30

xhochy approved these changes Jul 29, 2019

View reviewed changes

pitrou approved these changes Jul 29, 2019

View reviewed changes

wesm closed this in 089e3db Jul 29, 2019

bkietz mentioned this pull request Aug 1, 2019

ARROW-6077: [C++][Parquet] Build Arrow "schema tree" from Parquet schema to help with nested data implementation #4971

Closed

asfimport mentioned this pull request Aug 1, 2019

[C++] Implement alternative DictionaryBuilder that always yields int32 indices #22445

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-6042: [C++][Parquet] Add Dictionary32Builder that always returns 32-bit dictionary indices #4956

ARROW-6042: [C++][Parquet] Add Dictionary32Builder that always returns 32-bit dictionary indices #4956

Uh oh!

wesm commented Jul 27, 2019

Uh oh!

codecov-io commented Jul 27, 2019

Uh oh!

xhochy left a comment

Uh oh!

pitrou left a comment

Uh oh!

wesm commented Jul 29, 2019

Uh oh!

wesm commented Jul 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ARROW-6042: [C++][Parquet] Add Dictionary32Builder that always returns 32-bit dictionary indices #4956

ARROW-6042: [C++][Parquet] Add Dictionary32Builder that always returns 32-bit dictionary indices #4956

Uh oh!

Conversation

wesm commented Jul 27, 2019

Uh oh!

codecov-io commented Jul 27, 2019

Codecov Report

Uh oh!

xhochy left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

wesm commented Jul 29, 2019

Uh oh!

wesm commented Jul 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants