-
Notifications
You must be signed in to change notification settings - Fork 506
ORC-1116: [C++] Fix csv-import tool when exporting long bytes #1044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Let each string type columns use different databuffers in fillStringValues() function, so when one column is calling buffer.resize(), the previous invalidated buffer.data() is not affecting other columns.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR, @KyleGrains .
|
cc @wgtmac , @guiyanakuang , @stiga-huang too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test coverage at TestCSVFileImport.cc, @KyleGrains?
stiga-huang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this! The solution LGTM. Just have a minor comment.
|
tools/test/TestCSVFileImport.cc
Outdated
| TEST (TestCSVFileImport, testLongBinary) { | ||
| // create an ORC file from importing the CSV file | ||
| const std::string pgm1 = findProgram("tools/src/csv-import"); | ||
| const std::string csvFile = findExample("TestCSVFileImport.testLongBinary.csv"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test does not reveal the bug. We need a test that would fail without the fix. So I think we need a csv file with strings larger than 4MB. It'd be better to create the csv file in the test instead of importing it staticly in this project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the csv file is not covering this case. Sorry I am not familiar with gtest enough, is there any recommended way doing this, like generating data on the fly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the the csv file creation in TestCSVFileImport.cc to create temp file in the test
70558f4 to
2ab2cae
Compare
| } | ||
| } | ||
|
|
||
| TEST (TestCSVFileImport, testLongString) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this.
| "{\"_a\": \"str2\", \"_c\": \"var2\"}\n"; | ||
| EXPECT_EQ(0, runProgram({pgm2, option, orcFile}, output, error)); | ||
| EXPECT_EQ(expected, output); | ||
| EXPECT_EQ("", error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. The test case seems to pass without your patch still.
$ make package test-out
...
Test project /Users/dongjoon/APACHE/orc-merge/build
Start 1: orc-test
1/2 Test #1: orc-test ......................... Passed 2.91 sec
Start 2: tool-test
2/2 Test #2: tool-test ........................ Passed 12.29 sec
100% tests passed, 0 tests failed out of 2
Total Test time (real) = 15.21 sec
Built target test-out
$ git diff main --stat
tools/test/TestCSVFileImport.cc | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
The valid test case should fail on main branch without your patch. Could you take a look at this once more and make it sure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is a code problem or some other issues, the output like:
$make package test-out
...
Test project /home/kai/github/orc/build
Start 1: orc-test
1/2 Test #1: orc-test ......................... Passed 7.05 sec
Start 2: tool-test
2/2 Test #2: tool-test ........................***Failed 17.12 sec
ORC version: 1.8.0-SNAPSHOT
example dir = /home/kai/github/orc/examples
build dir = /home/kai/github/orc/build
[==========] Running 85 tests from 10 test cases.
[----------] Global test environment set-up.
[----------] 3 tests from TestCSVFileImport
[ RUN ] TestCSVFileImport.test10rows
[ OK ] TestCSVFileImport.test10rows (15 ms)
[ RUN ] TestCSVFileImport.testTimezoneOption
[ OK ] TestCSVFileImport.testTimezoneOption (29 ms)
[ RUN ] TestCSVFileImport.testLongString
/home/kai/github/orc/tools/test/TestCSVFileImport.cc:124: Failure
Expected: 0
To be equal to: runProgram({pgm2, option, orcFile}, output, error)
Which is: 1
/home/kai/github/orc/tools/test/TestCSVFileImport.cc:125: Failure
Expected: expected
Which is: "{"_a": "str1", "_c": "var1"}\n{"_a": "str2", "_c": "var2"}\n"
To be equal to: output
Which is: ""
With diff:
@@ -1,2 +1,1 @@
-{"_a": "str1", "_c": "var1"}
-{"_a": "str2", "_c": "var2"}\n
+""
/home/kai/github/orc/tools/test/TestCSVFileImport.cc:126: Failure
Expected: ""
To be equal to: error
Which is: "Caught exception in /tmp/test_csv_import_test_long_string.orc: File size too small\n"
[ FAILED ] TestCSVFileImport.testLongString (58 ms)
$git diff main --stat
tools/test/TestCSVFileImport.cc | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
$md5sum tools/test/TestCSVFileImport.cc
73b578e945183da5c3fdd8d16afda5c6 tools/test/TestCSVFileImport.cc
$cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the fix, the test failed in my env too
$ tools/test/tool-test --gtest_filter='*testLongString*'
ORC version: 1.8.0-SNAPSHOT
example dir = --gtest_filter=*testLongString*
build dir = .
Note: Google Test filter = *testLongString*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TestCSVFileImport
[ RUN ] TestCSVFileImport.testLongString
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:113: Failure
Expected: 0
To be equal to: runProgram({pgm1, schema, csvFile, orcFile}, output, error)
Which is: 139
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:114: Failure
Expected: ""
To be equal to: error
Which is: "Segmentation fault (core dumped)\n"
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:122: Failure
Expected: 0
To be equal to: runProgram({pgm2, option, orcFile}, output, error)
Which is: 1
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:123: Failure
Expected: expected
Which is: "{\"_a\": \"str1\", \"_c\": \"var1\"}\n{\"_a\": \"str2\", \"_c\": \"var2\"}\n"
To be equal to: output
Which is: ""
With diff:
@@ -1,2 +1,1 @@
-{\"_a\": \"str1\", \"_c\": \"var1\"}
-{\"_a\": \"str2\", \"_c\": \"var2\"}\n
+""
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:124: Failure
Expected: ""
To be equal to: error
Which is: "Caught exception in /tmp/test_csv_import_test_long_string.orc: File size too small\n"
[ FAILED ] TestCSVFileImport.testLongString (687 ms)
[----------] 1 test from TestCSVFileImport (687 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (687 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] TestCSVFileImport.testLongString
1 FAILED TEST
I'm on Ubuntu 16.04.6. The invalid pointers might have undefined behaviors. I think it's ok if this catch the bug in some env but not all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try it on Mac?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it on macOS 10.13.6, the test passed with and without the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't reproduce the problem on MacOS either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the clang++ compiled code is not reusing the invalidated address, so the referenced data is still there, undefined behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thank you for confirming, @stiga-huang and @KyleGrains .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you too, @stiga-huang and @dongjoon-hyun .
stiga-huang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| "{\"_a\": \"str2\", \"_c\": \"var2\"}\n"; | ||
| EXPECT_EQ(0, runProgram({pgm2, option, orcFile}, output, error)); | ||
| EXPECT_EQ(expected, output); | ||
| EXPECT_EQ("", error); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the fix, the test failed in my env too
$ tools/test/tool-test --gtest_filter='*testLongString*'
ORC version: 1.8.0-SNAPSHOT
example dir = --gtest_filter=*testLongString*
build dir = .
Note: Google Test filter = *testLongString*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TestCSVFileImport
[ RUN ] TestCSVFileImport.testLongString
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:113: Failure
Expected: 0
To be equal to: runProgram({pgm1, schema, csvFile, orcFile}, output, error)
Which is: 139
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:114: Failure
Expected: ""
To be equal to: error
Which is: "Segmentation fault (core dumped)\n"
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:122: Failure
Expected: 0
To be equal to: runProgram({pgm2, option, orcFile}, output, error)
Which is: 1
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:123: Failure
Expected: expected
Which is: "{\"_a\": \"str1\", \"_c\": \"var1\"}\n{\"_a\": \"str2\", \"_c\": \"var2\"}\n"
To be equal to: output
Which is: ""
With diff:
@@ -1,2 +1,1 @@
-{\"_a\": \"str1\", \"_c\": \"var1\"}
-{\"_a\": \"str2\", \"_c\": \"var2\"}\n
+""
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:124: Failure
Expected: ""
To be equal to: error
Which is: "Caught exception in /tmp/test_csv_import_test_long_string.orc: File size too small\n"
[ FAILED ] TestCSVFileImport.testLongString (687 ms)
[----------] 1 test from TestCSVFileImport (687 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (687 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] TestCSVFileImport.testLongString
1 FAILED TEST
I'm on Ubuntu 16.04.6. The invalid pointers might have undefined behaviors. I think it's ok if this catch the bug in some env but not all.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
- Thank you for your first contribution, @KyleGrains .
- Thank you for your time during your review/collaboration/approval here, @stiga-huang .
Merged to master/1.7.
cc @williamhyun because he is the release manager of Apache ORC 1.7.3.
### What changes were proposed in this pull request?
Let each string type columns use its own databuffer before writing data into orc file, so that when one column is calling buffer.resize(), the invalidated buffer.data() is not affecting other columns which points to it.
### Why are the changes needed?
When using the csv-import tool to export a orc file with schema like "struct<a:string,b:binary>", if the data in column "b" has very long bytes (over 4MB), the process could segmentation fault or the exported data in column "a" could become empty string.
Following the code in CSVFileImport.cc, when writing a orc file, all string type columns is using one databuffer inside function fillStringValues(). If a data length is larger than the buffer, the buffer will be resized. The resize() operation will cause all previous result of buffer.data() become invalid.
In this case, when field "a" finished writing data into buffer, field "b" begin writing will resize the buffer, invalidate previous buffer.data(), so field "a"'s stringBatch is not refering to a valid buffer.data() any more.
A workaround could use different databuffers for each string type column, however requires allocating 4MB memory each. Alternatively using one same databuffer, but let all previous stringBatch re-points to databuffer's new address, after a resize() operation happens.
### How was this patch tested?
Prepare a csv file with two columns, while the second column has data larger than 4MB, like:
_str1,longbinaryabcd.........._
Run following command which finished successfully:
_$ csv-import "struct<a:string,b:binary>" Testbinary.csv /tmp/test.orc_
Run the following command shows data correctly:
_$ orc-contents /tmp/test.orc --columns=0_
_{"a": "str1"}_
(cherry picked from commit 51d1a4d)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
|
@KyleGrains I added you to the Apache ORC contributor group and assigned ORC-1116 to you. |
Thanks, do I need to click any button on the issue page? |
|
No need from your side, @KyleGrains . :) |
…#1044) ### What changes were proposed in this pull request? Let each string type columns use its own databuffer before writing data into orc file, so that when one column is calling buffer.resize(), the invalidated buffer.data() is not affecting other columns which points to it. ### Why are the changes needed? When using the csv-import tool to export a orc file with schema like "struct<a:string,b:binary>", if the data in column "b" has very long bytes (over 4MB), the process could segmentation fault or the exported data in column "a" could become empty string. Following the code in CSVFileImport.cc, when writing a orc file, all string type columns is using one databuffer inside function fillStringValues(). If a data length is larger than the buffer, the buffer will be resized. The resize() operation will cause all previous result of buffer.data() become invalid. In this case, when field "a" finished writing data into buffer, field "b" begin writing will resize the buffer, invalidate previous buffer.data(), so field "a"'s stringBatch is not refering to a valid buffer.data() any more. A workaround could use different databuffers for each string type column, however requires allocating 4MB memory each. Alternatively using one same databuffer, but let all previous stringBatch re-points to databuffer's new address, after a resize() operation happens. ### How was this patch tested? Prepare a csv file with two columns, while the second column has data larger than 4MB, like: _str1,longbinaryabcd.........._ Run following command which finished successfully: _$ csv-import "struct<a:string,b:binary>" Testbinary.csv /tmp/test.orc_ Run the following command shows data correctly: _$ orc-contents /tmp/test.orc --columns=0_ _{"a": "str1"}_
What changes were proposed in this pull request?
Let each string type columns use its own databuffer before writing data into orc file, so that when one column is calling buffer.resize(), the invalidated buffer.data() is not affecting other columns which points to it.
Why are the changes needed?
When using the csv-import tool to export a orc file with schema like "struct<a:string,b:binary>", if the data in column "b" has very long bytes (over 4MB), the process could segmentation fault or the exported data in column "a" could become empty string.
Following the code in CSVFileImport.cc, when writing a orc file, all string type columns is using one databuffer inside function fillStringValues(). If a data length is larger than the buffer, the buffer will be resized. The resize() operation will cause all previous result of buffer.data() become invalid.
In this case, when field "a" finished writing data into buffer, field "b" begin writing will resize the buffer, invalidate previous buffer.data(), so field "a"'s stringBatch is not refering to a valid buffer.data() any more.
A workaround could use different databuffers for each string type column, however requires allocating 4MB memory each. Alternatively using one same databuffer, but let all previous stringBatch re-points to databuffer's new address, after a resize() operation happens.
How was this patch tested?
Prepare a csv file with two columns, while the second column has data larger than 4MB, like:
str1,longbinaryabcd..........
Run following command which finished successfully:
$ csv-import "struct<a:string,b:binary>" Testbinary.csv /tmp/test.orc
Run the following command shows data correctly:
$ orc-contents /tmp/test.orc --columns=0
{"a": "str1"}