-
Notifications
You must be signed in to change notification settings - Fork 4k
[EXP] ARROW-5052: [C++] Add IncompleteDictionaryType #4067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Not entirely sure yet this will be useful, so perhaps we shouldn't merge for now. |
cpp/src/arrow/type.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://issues.apache.org/jira/browse/ARROW-3144.
Before doing more detailed review, having a new type enum here is one option (which introduces some development complexity). Another option is to have IncompleteDictionary be a subclass of DictionaryType but otherwise identify itself as DICTIONARY.
I had originally thought to have a mutable dictionary type with atomic mutations to permit delta dictionaries to update "connected" arrays. What do you think about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So DictionaryType would gain a virtual method to indicate whether or not the dictionary is known yet, but also allowing for the dictionary to grow through deltas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I'm uncomfortable with the idea of a mutable data type, especially if it can change in another thread... The benefit of a dedicated enum value is that it forces us to think about both situations (a dictionary with known values, a dictionary with unknown values).
As for delta dictionaries, I don't know. Is the "delta" something that needs to be represented at the data type level, or is it a mere array? If the latter, we simply need to expose an operation to build a new DictionaryType with concatenated values, AFAICT.
(right now there's support for delta dictionaries in DictionaryBuilder, though it's a bit ad hoc)
@xhochy @emkornfield Any opinions?
|
I'm trying to understand how the IPC layer works wrt/ dictionaries. If we add an optional integer id to Edit: changed IPC code to do that (untested) |
c8deedb to
6b72575
Compare
Codecov Report
@@ Coverage Diff @@
## master #4067 +/- ##
==========================================
+ Coverage 87.82% 88.65% +0.82%
==========================================
Files 739 604 -135
Lines 90879 81451 -9428
Branches 1252 0 -1252
==========================================
- Hits 79817 72209 -7608
+ Misses 10941 9242 -1699
+ Partials 121 0 -121
Continue to review full report at Codecov.
|
5e33de1 to
ce05e46
Compare
674e098 to
87ab076
Compare
|
Some of this has been split and refactored into #4113, so this PR will have to be updated. |
87ab076 to
0616661
Compare
0616661 to
36499df
Compare
This allows passing information about a dictionary type with known index type and value type, but unknown dictionary values.
36499df to
fc70f93
Compare
|
I have an alternative proposal about handling schemas before the dictionary is known, that also addresses the problem of changing dictionaries. I'll write an e-mail to the mailing list to discuss tomorrow (April 22) |
|
Closing in favor of ARROW-3144 solution |
This allows passing information about a dictionary type with known index type and value type, but unknown dictionary values.