-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-590: Add integration tests for Union types #987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-590: Add integration tests for Union types #987
Conversation
…nknown null count Closes apache#747
Author: Max Risuhin <risuhin.max@gmail.com> Closes apache#749 from MaxRis/ARROW-742 and squashes the following commits: 5479b3f [Max Risuhin] ARROW-742: [C++] std::wstring_convert exceptions handling
cc @BryanCutler; this ended up being rather tedious. As one uncertainty: I wasn't sure what to write for the `typeLayout` field for dictionary fields. We don't use this at all in C++, so let me know what you need in Java (if anything) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#750 from wesm/ARROW-460 and squashes the following commits: 36d6019 [Wes McKinney] Put dictionaries in top level of json object 92e2a95 [Wes McKinney] Some debugging help, get test suite passing 15b8fc6 [Wes McKinney] Schema read correct, but array reads incorrect 6581bd6 [Wes McKinney] Cleaning up JSON ArrayReader to use inline visitors. Progress 4411566 [Wes McKinney] Misc fixes, dictionary schema roundtrip not complete yet 5958022 [Wes McKinney] Draft JSON roundtrip with dictionaries, not yet tested
Closes apache#751 Change-Id: I7afa8a65d4ad1180ff7f8b85da635a68f04e62e3
Author: Kengo Seki <sekikn@apache.org> Closes apache#753 from sekikn/ARROW-1117 and squashes the following commits: 5411349 [Kengo Seki] ARROW-1117: [Docs] Minor issues in GLib README
This has been deployed preliminarily for the release. I have not yet updated the subproject documentations yet; help would be appreciated Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#755 from wesm/ARROW-1118 and squashes the following commits: 0bdac0e [Wes McKinney] Website updates for 0.4.1
Closes apache#757 Change-Id: Idb147f5681e725f8852108ae3a1b870afc14973c
Author: Max Risuhin <risuhin.max@gmail.com> Closes apache#756 from MaxRis/ARROW-1096 and squashes the following commits: a864fe7 [Max Risuhin] [C++] CreateFileMapping maximum size calculation issue
Change-Id: Ic221871d3fbd1c51af1544471d80b632e135c7f6
While we could still build with NumPy>=1.9 for Python 2, Python 3 builds require >= 1.10 due to a bug in the C-headers. Change-Id: I0f9e0ad72e4ce4b1c6b44883d5781347d33f7e5b Author: Uwe L. Korn <uwelk@xhochy.com> Closes apache#758 from xhochy/ARROW-1124 and squashes the following commits: 5fff1ea [Uwe L. Korn] ARROW-1124: Increase numpy dependency to >=1.10.x
…and FindGTest Windows issues. Author: Max Risuhin <risuhin.max@gmail.com> Closes apache#759 from MaxRis/ARROW-784 and squashes the following commits: 358a9cc [Max Risuhin] ARROW-742: [C++] Use gflags from toolchain; Resolve cmake FindGFlags and FindGTest Windows issues.
Change-Id: I103cc31e1400491713b160b429c5491e8c82f810 Author: Uwe L. Korn <uwelk@xhochy.com> Closes apache#760 from xhochy/ARROW-1081 and squashes the following commits: e012438 [Uwe L. Korn] ARROW-1081: Fill null_bitmap correctly in TestBase
Author: Kengo Seki <sekikn@apache.org> Closes apache#762 from sekikn/ARROW-1128 and squashes the following commits: 2179516 [Kengo Seki] ARROW-1128: [Docs] command to build a wheel is not properly rendered
Minor regression introduced in apache#759 Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#763 from wesm/ARROW-1129 and squashes the following commits: afc4bb6 [Wes McKinney] Fix gflags library name being searched for in toolchain builds on Linux/macOS
Will merge on green build to unbreak Travis but it would be very helpful if a Java expert can review this change. Author: Uwe L. Korn <uwelk@xhochy.com> Closes apache#765 from xhochy/fix-jdk and squashes the following commits: 262f6f9 [Uwe L. Korn] Switch to openjdk7 751444a [Uwe L. Korn] Add JDK 7 to package list
Change-Id: Ib03392431851773df3b59b2c9a4d9a7bd672d2cb Author: Uwe L. Korn <uwelk@xhochy.com> Closes apache#761 from xhochy/ARROW-1123 and squashes the following commits: 2c75f56 [Uwe L. Korn] Use shared jemalloc always if available 2205586 [Uwe L. Korn] Run parquet tests in manylinux1 build 97f77d0 [Uwe L. Korn] Add pthread to static dependencies 7d01e9e [Uwe L. Korn] Cpplint bf478f1 [Uwe L. Korn] Fix allocator 4fbc7ba [Uwe L. Korn] Correct small allocations c3bacc0 [Uwe L. Korn] Revert "Also link static libs to librt" 44f0cfc [Uwe L. Korn] Also link static libs to librt bac694d [Uwe L. Korn] Also link static libs to librt 35212bc [Uwe L. Korn] Remove obsolete import check 4b714a1 [Uwe L. Korn] Don't force optional that is no longer there e004150 [Uwe L. Korn] Only search the for pthread library, not the headers e65d0d1 [Uwe L. Korn] ARROW-1123: Make jemalloc the default allocator
This supersedes apache#467 This is ready for review. Next steps are - Integration with the arrow CI - Write docs on how to use the object store There is one remaining compilation error (it doesn't find Python.h for one of the Travis configurations, if anybody has an idea on what is going on, let me know). Author: Philipp Moritz <pcmoritz@gmail.com> Author: Robert Nishihara <robertnishihara@gmail.com> Closes apache#742 from pcmoritz/plasma-store-2 and squashes the following commits: c100a45 [Philipp Moritz] fixes d67160c [Philipp Moritz] build dlmalloc with -O3 16d1f71 [Philipp Moritz] fix test hanging 0f321e1 [Philipp Moritz] try to fix tests 80f9df4 [Philipp Moritz] make format 4c474d7 [Philipp Moritz] run plasma_store from the right directory 85aa171 [Philipp Moritz] fix mac tests 61d421b [Philipp Moritz] fix formatting 4497e33 [Philipp Moritz] fix tests 00f17f2 [Philipp Moritz] fix licenses 8143792 [Philipp Moritz] fix linting 5370ae0 [Philipp Moritz] fix plasma protocol a137e78 [Philipp Moritz] more fixes b36c6aa [Philipp Moritz] fix fling.cc 214c426 [Philipp Moritz] fix eviction policy e7badc4 [Philipp Moritz] fix python extension 6432d3f [Philipp Moritz] fix formatting b21f081 [Philipp Moritz] fix remaining comments about client 27f9c9e [Philipp Moritz] fix formatting 7b08fd2 [Philipp Moritz] replace ObjectID pass by value with pass by const reference and fix const correctness ca80e9a [Philipp Moritz] remove plain pointer in plasma client, part II 627b7c7 [Philipp Moritz] fix python extension name 30bd68b [Philipp Moritz] remove plain pointer in plasma client, part I 77d9822 [Philipp Moritz] put all the object code into a common library 0fdd4cd [Philipp Moritz] link libarrow.a and remove hardcoded optimization flags 8daea69 [Philipp Moritz] fix includes according to google styleguide 65ac743 [Philipp Moritz] remove offending c++ flag from c flags 7003a4a [Philipp Moritz] fix valgrind test by setting working directory 217ff3d [Philipp Moritz] add valgrind heuristic 9c703c2 [Philipp Moritz] integrate client tests 9e5ae0e [Philipp Moritz] port serialization tests to gtest 0b8593d [Robert Nishihara] Port change from Ray. Change listen backlog size from 5 to 128. b9a5a06 [Philipp Moritz] fix includes ed680f9 [Philipp Moritz] reformat the code f40f85b [Philipp Moritz] add clang-format exceptions d6e60d2 [Philipp Moritz] do not compile plasma on windows f936adb [Philipp Moritz] build plasma python client only if python is available e11b0e8 [Philipp Moritz] fix pthread 74ecb19 [Philipp Moritz] don't link against Python libraries b1e0335 [Philipp Moritz] fix linting 7f7e7e7 [Philipp Moritz] more linting 79ea0ca [Philipp Moritz] fix clang-tidy 99420e8 [Philipp Moritz] add rat exceptions 6cee1e2 [Philipp Moritz] fix c93034f [Philipp Moritz] add Apache 2.0 headers 6372913 [Philipp Moritz] fix malloc? 99537c9 [Philipp Moritz] fix compiler warnings cb3f3a3 [Philipp Moritz] compile C files with CMAKE_C_FLAGS e649c2a [Philipp Moritz] fix compilation 04c2edb [Philipp Moritz] add missing file 51ab963 [Philipp Moritz] fix compiler warnings 9ef7f41 [Philipp Moritz] make the plasma store compile e9f9bb4 [Philipp Moritz] Initial commit of the plasma store. Contributors: Philipp Moritz, Robert Nishihara, Richard Shin, Stephanie Wang, Alexey Tumanov, Ion Stoica @ RISElab, UC Berkeley (2017) [from ray-project/ray@b94b4a3]
Author: Phillip Cloud <cpcloud@gmail.com> Closes apache#767 from cpcloud/ARROW-1140 and squashes the following commits: b046215 [Phillip Cloud] ARROW-1140: [C++] Allow optional build of plasma
Author: Uwe L. Korn <uwelk@xhochy.com> Closes apache#723 from xhochy/ARROW-1073 and squashes the following commits: 5bab9c2 [Uwe L. Korn] ARROW-1073: C++: Adapative integer builder
Change-Id: Ib815d3fa42f0a0ae6c0d9850e9b0b435bad1c331 Author: Uwe L. Korn <uwelk@xhochy.com> Closes apache#764 from xhochy/ARROW-1137 and squashes the following commits: 59c0df8 [Uwe L. Korn] Remove unused variables 1d11513 [Uwe L. Korn] ARROW-1137: Python: Ensure Pandas roundtrip of all-None column
Note that part of this problem was related to the fix I made in https://github.com/apache/parquet-cpp/pull/358/files#diff-2f5ceebd1726b16db561185cc620d18e Author: Uwe L. Korn <uwelk@xhochy.com> Closes apache#773 from xhochy/ARROW-1039 and squashes the following commits: 44a002a [Uwe L. Korn] ARROW-1039: Python: Remove duplicate column
Change-Id: Ib18dc6b00c9806aaf541c61cb63673ac51b0525c Author: Uwe L. Korn <uwelk@xhochy.com> Closes apache#772 from xhochy/ARROW-1143 and squashes the following commits: eff4430 [Uwe L. Korn] ARROW-1143: C++: Fix comparison of NullArray
Author: Phillip Cloud <cpcloud@gmail.com> Closes apache#774 from cpcloud/ARROW-1144 and squashes the following commits: a79aeba [Phillip Cloud] ARROW-1144: [C++] Remove unused variable
An alternative would be to patch dlmalloc similarly to https://github.com/greg7mdp/dlmalloc Author: Philipp Moritz <pcmoritz@gmail.com> Closes apache#769 from pcmoritz/dlmalloc-fix and squashes the following commits: 6602c7f [Philipp Moritz] clean up 672ec2e [Philipp Moritz] fix on g++ fd91ccd [Philipp Moritz] silence dlmalloc warnings on clang-4.0
Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#770 from wesm/ARROW-1136 and squashes the following commits: 6ae5cd8 [Wes McKinney] Centralize null checking bc3ec20 [Wes McKinney] Add null checks for invalid streams
…ntaining duplicate values to parquet Author: Phillip Cloud <cpcloud@gmail.com> Closes apache#768 from cpcloud/ARROW-1132 and squashes the following commits: 4b42f64 [Phillip Cloud] Add test for parquet roundtripping with dups 49684fd [Phillip Cloud] ARROW-1132: [Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate values to parquet
Author: Phillip Cloud <cpcloud@gmail.com> Closes apache#777 from cpcloud/ARROW-1146 and squashes the following commits: f97de96 [Phillip Cloud] ARROW-1146: Add .gitignore for *_generated.h files in src/plasma/format
|
cc @wesm The current state of this PR is that it passes all java unit test and integration test. The rest of the work is in the todo list on the top. Can you help get CPP integration test pass with the two example file? (union.json, union_with_no_type_id.json). |
|
@wesm I also added you as a collaborator so you can modify this branch. |
|
Cool, thanks for looking at this. I will look at the C++ failure tomorrow with any luck |
|
Having a look at this tomorrow (Friday). Today got taken up by JupyterCon |
|
I added generator for union in integration_test.py |
|
Great; I wasn't able to get to this yesterday, but I will as soon as I can |
Change-Id: I561c3c767e062d9e1e4abeaea05f49e400686260
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@icexelloss I fixed the C++ integration test. The C++ -> Java fails, and it appears that what Java is parsing from JSON is malformed. The length of the UnionVector appears to be zero. I stepped into the failure in a Java debugger and the binary data from the C++-generated Arrow file appears well-formed (has length 10 in the union.json case) but the JSON-parsed vector has length 0.
|
|
||
| @Override | ||
| public void loadFieldBuffers(ArrowFieldNode fieldNode, List<ArrowBuf> ownBuffers) { | ||
| // TODO: Check if need to truncate bits vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be similar to the code in struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Struct doesn't do any buffer truncating:
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java#L94
However, I am not sure about the purpose of buffer truncating here originally (and in other vectors too), maybe @siddharthteotia can help explain?
|
|
||
| MapVector internalMap; | ||
| UInt1Vector typeVector; | ||
| final BitVector bits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this called bits in other vectors? If so it might be worth renaming them to be more clear across the board.
We should also discuss whether a Union needs to be nullable. I think it probably does for consistency cc @jacques-n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bits is used in: https://github.com/apache/arrow/blob/master/java/vector/src/main/codegen/templates/NullableValueVectors.java#L57
I can rename it, but is probably cleaner in a separate PR?
The existing UnionVector supports nullable interface already, I keep it unchanged. There are other restriction in the existing UnionVector impl - the child vector must be nullable vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. I presume in existing Java use (e.g. in Dremio) that their unions are non-nullable, and nullability is addressed in the children. In general I think we need the ability to be nullable at the parent union level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, actually the existing UnionVector are already nullable on the parent level, but it uses type[i] == 0 to check null.
It seems sub-optimal to keep both validity maps in parent vector and children vectors, but the existing UnionVector java interface already returns nullable children vectors so changing it would break compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacques-n @siddharthteotia @StevenMPhillips we're going to need some eyes on these issues -- we should try to make the Union implementation consistent with the Arrow format spec without breaking existing downstream applications (or their binary format). I think that's possible but we will have to be careful
|
Thanks @wesm. I will investigate the json reader issue and why it didn't get caught by https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java#L93 |
|
@wesm, I fixed Java JsonReader and the integration test is passing for union with type id. Union without type id is failing because of a small schema issue, the arrow file produced by C++ seems to have typeIds = [0, 1] even though the source json file has typeIds = []. Can you take a look? Otherwise it looks good. |
|
The empty list of type ids is a bit of an eyesore. Why does that break the Java side? In my opinion, it should always be there. |
|
I don't like the empty list either. I only added it because it's optional. How about we always have type id set? It makes things easier. I can remove the empty type id test case. |
|
All subtasks done. @siddharthteotia @StevenMPhillips @jacques-n This PR changes Java UnionVector to match Arrow spec. Passing integration test with C++ after the change. Please take a look, thank you. |
|
Awesome, thanks @icexelloss! Since unions are used in Dremio we'll have to wait to get some feedback on the changes to make sure this doesn't impact the downstream use |
| private final List<BufferBacked> innerVectors; | ||
|
|
||
| public UnionVector(String name, BufferAllocator allocator, CallBack callBack) { | ||
| private int[] typeIds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@icexelloss, can you please provide insight into what you're trying to solve here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is about the format specification provides for fixed type code mapping, e.g.
type 0 -> 5
type 1 -> 10
type 2 -> 15
I think the original motivation was to be able to utilized fixed type codes for certain data types
In C++ you can see the arrow::UnionType constructor takes a vector of type codes https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L531
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wes is correct. Before the change the Java class assumes the type id to be the int value of the enum MinorType, which breaks interoperability with C++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't spent the time against understanding this yet but to clarify:
do you believe:
- the java vector has an invalid representation?
or - the java vector assumes a certain possible subset of cases?
The fact that the java user layer uses a specific type mapping (rather than supporting every type mapping) isn't in itself a violation of the spec. Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second. I believe the java vector assumes a certain possible subset of the cases -
specifically the java vector assumes the typeIds are the ordinal value of the enum MinorType. If the java class is consuming an union vector that happens to using ordinal value of enum MinorType for typeIds (for instance, a union vector produced by java) then it would work. Otherwise it would not work.
| List<FieldVector> children = internalMap.getChildren(); | ||
| int[] typeIds = new int[children.size()]; | ||
| for (ValueVector v : children) { | ||
| typeIds[childFields.size()] = v.getMinorType().ordinal(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacques-n This is the code I am referring to that assumes type id is the value of the enum MinorType rather than typeIds in the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my comment above, this isn't a violation of the spec as I read it, it simply doesn't support all variations of the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that an implementation might send codes over the wire like
[5, 10, 15, 5, 10, 15]
and then the type codes tell you that 5 is the 1st type in the union, 10 the 2nd, etc.
It was my recollection during the early Arrow days that there was a desire for certain codes to have a fixed interpretation. e.g. "4" might always be "JSON_STRING" in some Arrow implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my perspective on arrow types in the java layer has been that one should collapse to leaf nodes (in general, I've always been against the more complex arbritrary approach the spec allows). In the situation we usually see, it doesn't make sense to ever have duplicate types within a union and we wanted to have a simplified path for this scenario.
For example:
v1_record = {
date
record_id
measure1
}
v2_record = {
date
record_id
measure2
}
I would suggest using this representation:
record = {
date
record_id
measure1
measure2
}
This typically works well because initial union values usually come from separate stream (e.g. files) and processing is cleaner if we don't have indirect for common fields (a frequent case).
I'll take a look at the changes. I want to understand if you're introducing indirection or not.
|
Having the additional type codes metadata does add some complexity in using unions. If we're going to make a change to the format, we should do it for 0.8.0 since we are already breaking the metadata |
Still need to add Unions to the integration tests, but I validated this locally with the [JSON file](https://github.com/icexelloss/arrow/blob/2d6d41dbb041dbee06131b3f94a3fae4e0a90fd2/integration/data/union.json) from #987 and it seems to work: ```sh $ ./bin/json-to-arrow.js -j ./test/data/json/union.json -a ./test/data/union-file.arrow $ cat ./test/data/union-file.arrow | ./bin/file-to-stream.js > ./test/data/union-stream.arrow $ cat ./test/data/union-file.arrow | node ./targets/apache-arrow/bin/arrow2csv.js "row_id" | "union: Union<Int32 | Int64>" 0 | 0 1 | [1,0] 2 | 2 3 | [3,0] 4 | 4 5 | [5,0] 6 | 6 7 | [7,0] 8 | 8 9 | [9,0] $ cat ./test/data/union-stream.arrow | node ./targets/apache-arrow/bin/arrow2csv.js "row_id" | "union: Union<Int32 | Int64>" 0 | 0 1 | [1,0] 2 | 2 3 | [3,0] 4 | 4 5 | [5,0] 6 | 6 7 | [7,0] 8 | 8 9 | [9,0] ``` Author: ptaylor <paul.e.taylor@me.com> Closes #2092 from trxcllnt/js-finish-unions and squashes the following commits: 9b199c3 <ptaylor> finish implementing unions in the JS reader 953232c <ptaylor> write schema metadata
|
@icexelloss would you mind if I tried to take this over to update the PR and to make a proposal that keeps backwards compatibility with existing Java Union implementation? Also could you point me to the benchmark you ran, I'm having trouble finding it. |
|
@emkornfield I think it's OK if you take over the project of reconciling the Union implementations, since this is one of the last remaining items we've wanted to tackle for a 1.0 release. I'm closing this PR for now |
Overview
This PR changes Java UnionVector to match Arrow spec. Passing integration test with C++ after the change.
TODO List:
Benchmark:
Write:
Read: