ORC-1116: [C++] Fix csv-import tool when exporting long bytes #1044

KyleGrains · 2022-02-12T10:27:21Z

What changes were proposed in this pull request?

Let each string type columns use its own databuffer before writing data into orc file, so that when one column is calling buffer.resize(), the invalidated buffer.data() is not affecting other columns which points to it.

Why are the changes needed?

When using the csv-import tool to export a orc file with schema like "struct<a:string,b:binary>", if the data in column "b" has very long bytes (over 4MB), the process could segmentation fault or the exported data in column "a" could become empty string.

Following the code in CSVFileImport.cc, when writing a orc file, all string type columns is using one databuffer inside function fillStringValues(). If a data length is larger than the buffer, the buffer will be resized. The resize() operation will cause all previous result of buffer.data() become invalid.

In this case, when field "a" finished writing data into buffer, field "b" begin writing will resize the buffer, invalidate previous buffer.data(), so field "a"'s stringBatch is not refering to a valid buffer.data() any more.

A workaround could use different databuffers for each string type column, however requires allocating 4MB memory each. Alternatively using one same databuffer, but let all previous stringBatch re-points to databuffer's new address, after a resize() operation happens.

How was this patch tested?

Prepare a csv file with two columns, while the second column has data larger than 4MB, like:
str1,longbinaryabcd..........
Run following command which finished successfully:
$ csv-import "struct<a:string,b:binary>" Testbinary.csv /tmp/test.orc
Run the following command shows data correctly:
$ orc-contents /tmp/test.orc --columns=0
{"a": "str1"}

Let each string type columns use different databuffers in fillStringValues() function, so when one column is calling buffer.resize(), the previous invalidated buffer.data() is not affecting other columns.

dongjoon-hyun

Thank you for making a PR, @KyleGrains .

dongjoon-hyun · 2022-02-14T22:05:55Z

cc @wgtmac , @guiyanakuang , @stiga-huang too

dongjoon-hyun

Could you add a test coverage at TestCSVFileImport.cc, @KyleGrains?

stiga-huang

Thanks for fixing this! The solution LGTM. Just have a minor comment.

tools/src/CSVFileImport.cc

KyleGrains · 2022-02-16T02:00:56Z

Could you add a test coverage at TestCSVFileImport.cc, @KyleGrains?
I added a 'simple' csv file for testing string data, not quite sure how to generate a 4MB column data on the fly in testing process, or just adding a real large test file.

stiga-huang · 2022-02-16T05:56:37Z

tools/test/TestCSVFileImport.cc

+TEST (TestCSVFileImport, testLongBinary) {
+  // create an ORC file from importing the CSV file
+  const std::string pgm1 = findProgram("tools/src/csv-import");
+  const std::string csvFile = findExample("TestCSVFileImport.testLongBinary.csv");


I think this test does not reveal the bug. We need a test that would fail without the fix. So I think we need a csv file with strings larger than 4MB. It'd be better to create the csv file in the test instead of importing it staticly in this project.

Yes, the csv file is not covering this case. Sorry I am not familiar with gtest enough, is there any recommended way doing this, like generating data on the fly?

I added the the csv file creation in TestCSVFileImport.cc to create temp file in the test

dongjoon-hyun · 2022-02-17T18:00:18Z

tools/test/TestCSVFileImport.cc

  }
 }
+
+TEST (TestCSVFileImport, testLongString) {


Thank you for adding this.

dongjoon-hyun · 2022-02-17T18:14:19Z

tools/test/TestCSVFileImport.cc

+    "{\"_a\": \"str2\", \"_c\": \"var2\"}\n";
+  EXPECT_EQ(0, runProgram({pgm2, option, orcFile}, output, error));
+  EXPECT_EQ(expected, output);
+  EXPECT_EQ("", error);


Hmm. The test case seems to pass without your patch still.

$ make package test-out ... Test project /Users/dongjoon/APACHE/orc-merge/build Start 1: orc-test 1/2 Test #1: orc-test ......................... Passed 2.91 sec Start 2: tool-test 2/2 Test #2: tool-test ........................ Passed 12.29 sec 100% tests passed, 0 tests failed out of 2 Total Test time (real) = 15.21 sec Built target test-out $ git diff main --stat tools/test/TestCSVFileImport.cc | 35 +++++++++++++++++++++++++++++++++++ 1 file changed, 35 insertions(+)

The valid test case should fail on main branch without your patch. Could you take a look at this once more and make it sure?

I am not sure if this is a code problem or some other issues, the output like:
$make package test-out
...
Test project /home/kai/github/orc/build
Start 1: orc-test
1/2 Test #1: orc-test ......................... Passed 7.05 sec
Start 2: tool-test
2/2 Test #2: tool-test ........................***Failed 17.12 sec
ORC version: 1.8.0-SNAPSHOT
example dir = /home/kai/github/orc/examples
build dir = /home/kai/github/orc/build
[==========] Running 85 tests from 10 test cases.
[----------] Global test environment set-up.
[----------] 3 tests from TestCSVFileImport
[ RUN ] TestCSVFileImport.test10rows
[ OK ] TestCSVFileImport.test10rows (15 ms)
[ RUN ] TestCSVFileImport.testTimezoneOption
[ OK ] TestCSVFileImport.testTimezoneOption (29 ms)
[ RUN ] TestCSVFileImport.testLongString
/home/kai/github/orc/tools/test/TestCSVFileImport.cc:124: Failure
Expected: 0
To be equal to: runProgram({pgm2, option, orcFile}, output, error)
Which is: 1
/home/kai/github/orc/tools/test/TestCSVFileImport.cc:125: Failure
Expected: expected
Which is: "{"_a": "str1", "_c": "var1"}\n{"_a": "str2", "_c": "var2"}\n"
To be equal to: output
Which is: ""
With diff:
@@ -1,2 +1,1 @@
-{"_a": "str1", "_c": "var1"}
-{"_a": "str2", "_c": "var2"}\n
+""

/home/kai/github/orc/tools/test/TestCSVFileImport.cc:126: Failure
Expected: ""
To be equal to: error
Which is: "Caught exception in /tmp/test_csv_import_test_long_string.orc: File size too small\n"
[ FAILED ] TestCSVFileImport.testLongString (58 ms)

$git diff main --stat
tools/test/TestCSVFileImport.cc | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

$md5sum tools/test/TestCSVFileImport.cc
73b578e945183da5c3fdd8d16afda5c6 tools/test/TestCSVFileImport.cc

$cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)

Without the fix, the test failed in my env too

$ tools/test/tool-test --gtest_filter='*testLongString*' ORC version: 1.8.0-SNAPSHOT example dir = --gtest_filter=*testLongString* build dir = . Note: Google Test filter = *testLongString* [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from TestCSVFileImport [ RUN ] TestCSVFileImport.testLongString /home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:113: Failure Expected: 0 To be equal to: runProgram({pgm1, schema, csvFile, orcFile}, output, error) Which is: 139 /home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:114: Failure Expected: "" To be equal to: error Which is: "Segmentation fault (core dumped)\n" /home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:122: Failure Expected: 0 To be equal to: runProgram({pgm2, option, orcFile}, output, error) Which is: 1 /home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:123: Failure Expected: expected Which is: "{\"_a\": \"str1\", \"_c\": \"var1\"}\n{\"_a\": \"str2\", \"_c\": \"var2\"}\n" To be equal to: output Which is: "" With diff: @@ -1,2 +1,1 @@ -{\"_a\": \"str1\", \"_c\": \"var1\"} -{\"_a\": \"str2\", \"_c\": \"var2\"}\n +"" /home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:124: Failure Expected: "" To be equal to: error Which is: "Caught exception in /tmp/test_csv_import_test_long_string.orc: File size too small\n" [ FAILED ] TestCSVFileImport.testLongString (687 ms) [----------] 1 test from TestCSVFileImport (687 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (687 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] TestCSVFileImport.testLongString 1 FAILED TEST

I'm on Ubuntu 16.04.6. The invalid pointers might have undefined behaviors. I think it's ok if this catch the bug in some env but not all.

Could you try it on Mac?

I tried it on macOS 10.13.6, the test passed with and without the fix.

I can't reproduce the problem on MacOS either.

Maybe the clang++ compiled code is not reusing the invalidated address, so the referenced data is still there, undefined behaviour.

Got it. Thank you for confirming, @stiga-huang and @KyleGrains .

Thank you too, @stiga-huang and @dongjoon-hyun .

stiga-huang

LGTM

stiga-huang · 2022-02-18T06:59:02Z

tools/test/TestCSVFileImport.cc

+    "{\"_a\": \"str2\", \"_c\": \"var2\"}\n";
+  EXPECT_EQ(0, runProgram({pgm2, option, orcFile}, output, error));
+  EXPECT_EQ(expected, output);
+  EXPECT_EQ("", error);


Without the fix, the test failed in my env too

$ tools/test/tool-test --gtest_filter='*testLongString*' ORC version: 1.8.0-SNAPSHOT example dir = --gtest_filter=*testLongString* build dir = . Note: Google Test filter = *testLongString* [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from TestCSVFileImport [ RUN ] TestCSVFileImport.testLongString /home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:113: Failure Expected: 0 To be equal to: runProgram({pgm1, schema, csvFile, orcFile}, output, error) Which is: 139 /home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:114: Failure Expected: "" To be equal to: error Which is: "Segmentation fault (core dumped)\n" /home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:122: Failure Expected: 0 To be equal to: runProgram({pgm2, option, orcFile}, output, error) Which is: 1 /home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:123: Failure Expected: expected Which is: "{\"_a\": \"str1\", \"_c\": \"var1\"}\n{\"_a\": \"str2\", \"_c\": \"var2\"}\n" To be equal to: output Which is: "" With diff: @@ -1,2 +1,1 @@ -{\"_a\": \"str1\", \"_c\": \"var1\"} -{\"_a\": \"str2\", \"_c\": \"var2\"}\n +"" /home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:124: Failure Expected: "" To be equal to: error Which is: "Caught exception in /tmp/test_csv_import_test_long_string.orc: File size too small\n" [ FAILED ] TestCSVFileImport.testLongString (687 ms) [----------] 1 test from TestCSVFileImport (687 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (687 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] TestCSVFileImport.testLongString 1 FAILED TEST

I'm on Ubuntu 16.04.6. The invalid pointers might have undefined behaviors. I think it's ok if this catch the bug in some env but not all.

dongjoon-hyun

+1, LGTM.

Thank you for your first contribution, @KyleGrains .
Thank you for your time during your review/collaboration/approval here, @stiga-huang .

Merged to master/1.7.

cc @williamhyun because he is the release manager of Apache ORC 1.7.3.

### What changes were proposed in this pull request? Let each string type columns use its own databuffer before writing data into orc file, so that when one column is calling buffer.resize(), the invalidated buffer.data() is not affecting other columns which points to it. ### Why are the changes needed? When using the csv-import tool to export a orc file with schema like "struct<a:string,b:binary>", if the data in column "b" has very long bytes (over 4MB), the process could segmentation fault or the exported data in column "a" could become empty string. Following the code in CSVFileImport.cc, when writing a orc file, all string type columns is using one databuffer inside function fillStringValues(). If a data length is larger than the buffer, the buffer will be resized. The resize() operation will cause all previous result of buffer.data() become invalid. In this case, when field "a" finished writing data into buffer, field "b" begin writing will resize the buffer, invalidate previous buffer.data(), so field "a"'s stringBatch is not refering to a valid buffer.data() any more. A workaround could use different databuffers for each string type column, however requires allocating 4MB memory each. Alternatively using one same databuffer, but let all previous stringBatch re-points to databuffer's new address, after a resize() operation happens. ### How was this patch tested? Prepare a csv file with two columns, while the second column has data larger than 4MB, like: _str1,longbinaryabcd.........._ Run following command which finished successfully: _$ csv-import "struct<a:string,b:binary>" Testbinary.csv /tmp/test.orc_ Run the following command shows data correctly: _$ orc-contents /tmp/test.orc --columns=0_ _{"a": "str1"}_ (cherry picked from commit 51d1a4d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2022-02-21T01:57:14Z

@KyleGrains I added you to the Apache ORC contributor group and assigned ORC-1116 to you.
Welcome to the Apache ORC community again.

KyleGrains · 2022-02-21T05:30:48Z

@KyleGrains I added you to the Apache ORC contributor group and assigned ORC-1116 to you. Welcome to the Apache ORC community again.

Thanks, do I need to click any button on the issue page?

dongjoon-hyun · 2022-02-21T20:17:12Z

No need from your side, @KyleGrains . :)

…#1044) ### What changes were proposed in this pull request? Let each string type columns use its own databuffer before writing data into orc file, so that when one column is calling buffer.resize(), the invalidated buffer.data() is not affecting other columns which points to it. ### Why are the changes needed? When using the csv-import tool to export a orc file with schema like "struct<a:string,b:binary>", if the data in column "b" has very long bytes (over 4MB), the process could segmentation fault or the exported data in column "a" could become empty string. Following the code in CSVFileImport.cc, when writing a orc file, all string type columns is using one databuffer inside function fillStringValues(). If a data length is larger than the buffer, the buffer will be resized. The resize() operation will cause all previous result of buffer.data() become invalid. In this case, when field "a" finished writing data into buffer, field "b" begin writing will resize the buffer, invalidate previous buffer.data(), so field "a"'s stringBatch is not refering to a valid buffer.data() any more. A workaround could use different databuffers for each string type column, however requires allocating 4MB memory each. Alternatively using one same databuffer, but let all previous stringBatch re-points to databuffer's new address, after a resize() operation happens. ### How was this patch tested? Prepare a csv file with two columns, while the second column has data larger than 4MB, like: _str1,longbinaryabcd.........._ Run following command which finished successfully: _$ csv-import "struct<a:string,b:binary>" Testbinary.csv /tmp/test.orc_ Run the following command shows data correctly: _$ orc-contents /tmp/test.orc --columns=0_ _{"a": "str1"}_

ORC-1116: [C++] Fix csv-import tool when exporting long bytes

8c4a5c3

Let each string type columns use different databuffers in fillStringValues() function, so when one column is calling buffer.resize(), the previous invalidated buffer.data() is not affecting other columns.

github-actions bot added the CPP label Feb 12, 2022

dongjoon-hyun reviewed Feb 14, 2022

View reviewed changes

dongjoon-hyun added this to the 1.7.4 milestone Feb 14, 2022

dongjoon-hyun reviewed Feb 16, 2022

View reviewed changes

stiga-huang reviewed Feb 16, 2022

View reviewed changes

tools/src/CSVFileImport.cc Outdated Show resolved Hide resolved

Reduce inited buffer size to 1MB per column

2ab2cae

github-actions bot added the EXAMPLES label Feb 16, 2022

stiga-huang reviewed Feb 16, 2022

View reviewed changes

KyleGrains force-pushed the csv-import-fix branch 2 times, most recently from 70558f4 to 2ab2cae Compare February 16, 2022 08:48

Add a test coverage for importing long string data

3534497

dongjoon-hyun reviewed Feb 17, 2022

View reviewed changes

tools/test/TestCSVFileImport.cc

}

}

TEST (TestCSVFileImport, testLongString) {

Copy link

Member

dongjoon-hyun Feb 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this.

dongjoon-hyun reviewed Feb 17, 2022

View reviewed changes

stiga-huang approved these changes Feb 18, 2022

View reviewed changes

dongjoon-hyun approved these changes Feb 21, 2022

View reviewed changes

dongjoon-hyun merged commit 51d1a4d into apache:main Feb 21, 2022

ORC-1116: [C++] Fix csv-import tool when exporting long bytes #1044

ORC-1116: [C++] Fix csv-import tool when exporting long bytes #1044

Uh oh!

Conversation

KyleGrains commented Feb 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 14, 2022

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stiga-huang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KyleGrains commented Feb 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KyleGrains Feb 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KyleGrains Feb 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KyleGrains Feb 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stiga-huang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Feb 21, 2022

Uh oh!

KyleGrains commented Feb 21, 2022

Uh oh!

dongjoon-hyun commented Feb 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KyleGrains commented Feb 12, 2022 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

KyleGrains commented Feb 16, 2022 •

edited

Loading

KyleGrains Feb 18, 2022 •

edited

Loading

KyleGrains Feb 21, 2022 •

edited

Loading

KyleGrains Feb 21, 2022 •

edited

Loading