-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8088: [C++][Dataset] Support dictionary partition columns #6641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@bkietz I added a python test case as well |
|
Thanks @jorisvandenbossche |
fsaintjacques
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, except for the Json representation, I think we should shoot for the most obvious which in this case is parsing primitive values with dict encoding.
cpp/src/arrow/ipc/json_simple.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd expect the ArrayFromJson to just works with primitive values, e.g. ArrayFromJson(dict(uf8(), int8()), R"["horse", "cat", null]"); should just work. We don't need to encode the physical representation, unless I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to encode the indices as well to support comparison with sliced dictionary arrays. For example:
// indices ordered as if assembled with DictionaryBuilder
// from ["whiskey", "tango", "foxtrot"]
auto array_wtf = ArrayFromJSON(dictionary(int8(), utf8()), R"([
[0, ["whiskey", "tango", "foxtrot"]],
[1],
[2]
])");
// decoded values are ["tango", "foxtrot"], but indices *are not*
// as if assembled with DictionaryBuilder from ["tango", "foxtrot"]
auto array_tf = array_wtf->Slice(1);
assert(ArrayFromJSON(dictionary(int8(), utf8()), R"([
[1, ["whiskey", "tango", "foxtrot"]],
[2]
])")->Equals(array_tf));There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glancing the string example and I can't intuitively understand what it maps to.
auto dict_type = dictionary(int8(), utf8());
// Obvious with a glance
auto dict_array = ArrayFromJSON(dict_type, R"(["whiskey", "tango", "foxtrot"])");
// Require cognitive overhead and reading the decoder to understand what's happening.
auto dict_array = ArrayFromJSON(dict_type, R"([
[0, ["whiskey", "tango", "foxtrot"]],
[1],
[2]
])");When precise physical layout is needed, I recommend adding a family of methods:
DictArrayFromJSON(type, std::shared_ptr<Array> values, std::string indices_string);
DictArrayFromJSON(type, std::string values_string, std::string indices_string);I'm suggesting this because we want tests to be trivial and succinct to read. The end result is that 80% of tests which don't need exact layout will be easy to read, while the edge case will have the special method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll extract DictArrayFromJSON
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: the ArrayFromJSON behavior you describe is highly non trivial to implement, so I'll leave that for a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a553390 to
a3f3aa0
Compare
This PR necessitated adding
Scalar::Parse,MakeArrayFromScalar, andArrayFromJSONcases for dictionary type.