Skip to content

Conversation

@wesm
Copy link
Member

@wesm wesm commented Jul 27, 2019

Without this, DictionaryArrays produced by different parts of a Parquet file could have different index types depending on the cardinality of each decoded part.

Refactors DictionaryBuilder (which uses AdaptiveIntBuilder) and Dictionary32Builder (which uses Int32Builder) to have a common base class.

@codecov-io
Copy link

Codecov Report

Merging #4956 into master will increase coverage by 1.62%.
The diff coverage is 97.4%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4956      +/-   ##
==========================================
+ Coverage   87.49%   89.11%   +1.62%     
==========================================
  Files         998      721     -277     
  Lines      141784   101687   -40097     
  Branches     1418        0    -1418     
==========================================
- Hits       124058    90623   -33435     
+ Misses      17364    11064    -6300     
+ Partials      362        0     -362
Impacted Files Coverage Δ
cpp/src/parquet/encoding-test.cc 100% <ø> (ø) ⬆️
cpp/src/parquet/encoding.h 97.82% <ø> (ø) ⬆️
cpp/src/parquet/column_reader.cc 88.93% <ø> (ø) ⬆️
cpp/src/arrow/array-dict-test.cc 95.52% <100%> (+0.13%) ⬆️
cpp/src/parquet/arrow/arrow-reader-writer-test.cc 93.71% <100%> (ø) ⬆️
cpp/src/parquet/encoding.cc 93.73% <100%> (ø) ⬆️
cpp/src/arrow/array/builder_dict.h 88.57% <95.12%> (+0.33%) ⬆️
cpp/src/arrow/array/builder_adaptive.h 93.75% <0%> (-3.13%) ⬇️
go/arrow/ipc/writer.go
... and 278 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38b0176...5210b30. Read the comment docs.

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I'm curious, does this make parquet conversion faster?

@wesm
Copy link
Member Author

wesm commented Jul 29, 2019

@pitrou I haven't done comprehensive benchmarks yet but the main benefit is actually memory use (see example of user running into runaway memory usage on a highly encoded file at https://issues.apache.org/jira/browse/ARROW-5993). Performance should be better, too. I'll run benchmarks to illustrate

@wesm
Copy link
Member Author

wesm commented Jul 29, 2019

BTW @pitrou this patch is also necessary if you want to add dictionary encoding to the CSV reader (because each chunk may have a different dictionary cardinality)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants