Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Mar 28, 2019

This allows passing information about a dictionary type with known index type and value type, but unknown dictionary values.

@pitrou
Copy link
Member Author

pitrou commented Mar 28, 2019

Not entirely sure yet this will be useful, so perhaps we shouldn't merge for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://issues.apache.org/jira/browse/ARROW-3144.

Before doing more detailed review, having a new type enum here is one option (which introduces some development complexity). Another option is to have IncompleteDictionary be a subclass of DictionaryType but otherwise identify itself as DICTIONARY.

I had originally thought to have a mutable dictionary type with atomic mutations to permit delta dictionaries to update "connected" arrays. What do you think about that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So DictionaryType would gain a virtual method to indicate whether or not the dictionary is known yet, but also allowing for the dictionary to grow through deltas

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I'm uncomfortable with the idea of a mutable data type, especially if it can change in another thread... The benefit of a dedicated enum value is that it forces us to think about both situations (a dictionary with known values, a dictionary with unknown values).

As for delta dictionaries, I don't know. Is the "delta" something that needs to be represented at the data type level, or is it a mere array? If the latter, we simply need to expose an operation to build a new DictionaryType with concatenated values, AFAICT.

(right now there's support for delta dictionaries in DictionaryBuilder, though it's a bit ad hoc)

@xhochy @emkornfield Any opinions?

@pitrou
Copy link
Member Author

pitrou commented Mar 28, 2019

I'm trying to understand how the IPC layer works wrt/ dictionaries. If we add an optional integer id to IncompleteDictionaryType to refer to the id of the forthcoming dictionary array, perhaps it can allow ipc::ReadSchema(const Message& message, ...) to parse dictionary types and return them as incomplete? (right now, it would probably error out as it uses an empty dictionary memo)

Edit: changed IPC code to do that (untested)

@pitrou pitrou changed the title ARROW-5052: [C++] Add IncompleteDictionaryType [EXP] ARROW-5052: [C++] Add IncompleteDictionaryType Mar 28, 2019
@pitrou pitrou force-pushed the ARROW-5052-incomplete-dict-type branch 4 times, most recently from c8deedb to 6b72575 Compare April 2, 2019 11:48
@codecov-io
Copy link

Codecov Report

Merging #4067 into master will increase coverage by 0.82%.
The diff coverage is 90.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4067      +/-   ##
==========================================
+ Coverage   87.82%   88.65%   +0.82%     
==========================================
  Files         739      604     -135     
  Lines       90879    81451    -9428     
  Branches     1252        0    -1252     
==========================================
- Hits        79817    72209    -7608     
+ Misses      10941     9242    -1699     
+ Partials      121        0     -121
Impacted Files Coverage Δ
cpp/src/arrow/type_traits.h 92.95% <ø> (ø) ⬆️
cpp/src/arrow/visitor.h 50% <ø> (ø) ⬆️
cpp/src/arrow/ipc/json-internal.cc 93.28% <0%> (+0.21%) ⬆️
cpp/src/arrow/python/numpy_to_arrow.cc 95.41% <0%> (+0.81%) ⬆️
cpp/src/arrow/python/arrow_to_pandas.cc 91.37% <0%> (+0.18%) ⬆️
cpp/src/gandiva/expression_registry.cc 81.9% <0%> (ø) ⬆️
cpp/src/arrow/visitor.cc 0% <0%> (ø) ⬆️
cpp/src/arrow/type-test.cc 99.26% <100%> (+0.14%) ⬆️
cpp/src/arrow/ipc/read-write-test.cc 99.82% <100%> (ø) ⬆️
cpp/src/arrow/array-dict-test.cc 100% <100%> (ø) ⬆️
... and 144 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f9e21ae...5e33de1. Read the comment docs.

@pitrou pitrou force-pushed the ARROW-5052-incomplete-dict-type branch from 5e33de1 to ce05e46 Compare April 3, 2019 12:39
@pitrou pitrou changed the title [EXP] ARROW-5052: [C++] Add IncompleteDictionaryType [EXP] [WIP] ARROW-5052: [C++] Add IncompleteDictionaryType Apr 3, 2019
@pitrou pitrou force-pushed the ARROW-5052-incomplete-dict-type branch from 674e098 to 87ab076 Compare April 4, 2019 10:02
@pitrou
Copy link
Member Author

pitrou commented Apr 8, 2019

Some of this has been split and refactored into #4113, so this PR will have to be updated.

@pitrou pitrou force-pushed the ARROW-5052-incomplete-dict-type branch from 87ab076 to 0616661 Compare April 9, 2019 11:20
@pitrou pitrou changed the title [EXP] [WIP] ARROW-5052: [C++] Add IncompleteDictionaryType [EXP] ARROW-5052: [C++] Add IncompleteDictionaryType Apr 9, 2019
@pitrou pitrou force-pushed the ARROW-5052-incomplete-dict-type branch from 0616661 to 36499df Compare April 9, 2019 21:05
This allows passing information about a dictionary type with
known index type and value type, but unknown dictionary values.
@pitrou pitrou force-pushed the ARROW-5052-incomplete-dict-type branch from 36499df to fc70f93 Compare April 9, 2019 21:06
@wesm
Copy link
Member

wesm commented Apr 22, 2019

I have an alternative proposal about handling schemas before the dictionary is known, that also addresses the problem of changing dictionaries. I'll write an e-mail to the mailing list to discuss tomorrow (April 22)

@wesm
Copy link
Member

wesm commented May 22, 2019

Closing in favor of ARROW-3144 solution

@wesm wesm closed this May 22, 2019
@pitrou pitrou deleted the ARROW-5052-incomplete-dict-type branch May 22, 2019 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants