Skip to content

Conversation

@KyleGrains
Copy link
Contributor

@KyleGrains KyleGrains commented Feb 12, 2022

What changes were proposed in this pull request?

Let each string type columns use its own databuffer before writing data into orc file, so that when one column is calling buffer.resize(), the invalidated buffer.data() is not affecting other columns which points to it.

Why are the changes needed?

When using the csv-import tool to export a orc file with schema like "struct<a:string,b:binary>", if the data in column "b" has very long bytes (over 4MB), the process could segmentation fault or the exported data in column "a" could become empty string.

Following the code in CSVFileImport.cc, when writing a orc file, all string type columns is using one databuffer inside function fillStringValues(). If a data length is larger than the buffer, the buffer will be resized. The resize() operation will cause all previous result of buffer.data() become invalid.

In this case, when field "a" finished writing data into buffer, field "b" begin writing will resize the buffer, invalidate previous buffer.data(), so field "a"'s stringBatch is not refering to a valid buffer.data() any more.

A workaround could use different databuffers for each string type column, however requires allocating 4MB memory each. Alternatively using one same databuffer, but let all previous stringBatch re-points to databuffer's new address, after a resize() operation happens.

How was this patch tested?

Prepare a csv file with two columns, while the second column has data larger than 4MB, like:
str1,longbinaryabcd..........
Run following command which finished successfully:
$ csv-import "struct<a:string,b:binary>" Testbinary.csv /tmp/test.orc
Run the following command shows data correctly:
$ orc-contents /tmp/test.orc --columns=0
{"a": "str1"}

Let each string type columns use different databuffers in fillStringValues() function, so when one column is calling buffer.resize(), the previous invalidated buffer.data() is not affecting other columns.
@github-actions github-actions bot added the CPP label Feb 12, 2022
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, @KyleGrains .

@dongjoon-hyun
Copy link
Member

cc @wgtmac , @guiyanakuang , @stiga-huang too

@dongjoon-hyun dongjoon-hyun added this to the 1.7.4 milestone Feb 14, 2022
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test coverage at TestCSVFileImport.cc, @KyleGrains?

Copy link
Contributor

@stiga-huang stiga-huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! The solution LGTM. Just have a minor comment.

@KyleGrains
Copy link
Contributor Author

KyleGrains commented Feb 16, 2022

Could you add a test coverage at TestCSVFileImport.cc, @KyleGrains?
I added a 'simple' csv file for testing string data, not quite sure how to generate a 4MB column data on the fly in testing process, or just adding a real large test file.

TEST (TestCSVFileImport, testLongBinary) {
// create an ORC file from importing the CSV file
const std::string pgm1 = findProgram("tools/src/csv-import");
const std::string csvFile = findExample("TestCSVFileImport.testLongBinary.csv");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test does not reveal the bug. We need a test that would fail without the fix. So I think we need a csv file with strings larger than 4MB. It'd be better to create the csv file in the test instead of importing it staticly in this project.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the csv file is not covering this case. Sorry I am not familiar with gtest enough, is there any recommended way doing this, like generating data on the fly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the the csv file creation in TestCSVFileImport.cc to create temp file in the test

@KyleGrains KyleGrains force-pushed the csv-import-fix branch 2 times, most recently from 70558f4 to 2ab2cae Compare February 16, 2022 08:48
}
}

TEST (TestCSVFileImport, testLongString) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this.

"{\"_a\": \"str2\", \"_c\": \"var2\"}\n";
EXPECT_EQ(0, runProgram({pgm2, option, orcFile}, output, error));
EXPECT_EQ(expected, output);
EXPECT_EQ("", error);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. The test case seems to pass without your patch still.

$ make package test-out
...
Test project /Users/dongjoon/APACHE/orc-merge/build
    Start 1: orc-test
1/2 Test #1: orc-test .........................   Passed    2.91 sec
    Start 2: tool-test
2/2 Test #2: tool-test ........................   Passed   12.29 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) =  15.21 sec
Built target test-out

$ git diff main --stat
 tools/test/TestCSVFileImport.cc | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

The valid test case should fail on main branch without your patch. Could you take a look at this once more and make it sure?

Copy link
Contributor Author

@KyleGrains KyleGrains Feb 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is a code problem or some other issues, the output like:
$make package test-out
...
Test project /home/kai/github/orc/build
Start 1: orc-test
1/2 Test #1: orc-test ......................... Passed 7.05 sec
Start 2: tool-test
2/2 Test #2: tool-test ........................***Failed 17.12 sec
ORC version: 1.8.0-SNAPSHOT
example dir = /home/kai/github/orc/examples
build dir = /home/kai/github/orc/build
[==========] Running 85 tests from 10 test cases.
[----------] Global test environment set-up.
[----------] 3 tests from TestCSVFileImport
[ RUN ] TestCSVFileImport.test10rows
[ OK ] TestCSVFileImport.test10rows (15 ms)
[ RUN ] TestCSVFileImport.testTimezoneOption
[ OK ] TestCSVFileImport.testTimezoneOption (29 ms)
[ RUN ] TestCSVFileImport.testLongString
/home/kai/github/orc/tools/test/TestCSVFileImport.cc:124: Failure
Expected: 0
To be equal to: runProgram({pgm2, option, orcFile}, output, error)
Which is: 1
/home/kai/github/orc/tools/test/TestCSVFileImport.cc:125: Failure
Expected: expected
Which is: "{"_a": "str1", "_c": "var1"}\n{"_a": "str2", "_c": "var2"}\n"
To be equal to: output
Which is: ""
With diff:
@@ -1,2 +1,1 @@
-{"_a": "str1", "_c": "var1"}
-{"_a": "str2", "_c": "var2"}\n
+""

/home/kai/github/orc/tools/test/TestCSVFileImport.cc:126: Failure
Expected: ""
To be equal to: error
Which is: "Caught exception in /tmp/test_csv_import_test_long_string.orc: File size too small\n"
[ FAILED ] TestCSVFileImport.testLongString (58 ms)

$git diff main --stat
tools/test/TestCSVFileImport.cc | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

$md5sum tools/test/TestCSVFileImport.cc
73b578e945183da5c3fdd8d16afda5c6 tools/test/TestCSVFileImport.cc

$cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the fix, the test failed in my env too

$ tools/test/tool-test --gtest_filter='*testLongString*'
ORC version: 1.8.0-SNAPSHOT
example dir = --gtest_filter=*testLongString*
build dir = .
Note: Google Test filter = *testLongString*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TestCSVFileImport
[ RUN      ] TestCSVFileImport.testLongString
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:113: Failure
      Expected: 0
To be equal to: runProgram({pgm1, schema, csvFile, orcFile}, output, error)
      Which is: 139
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:114: Failure
      Expected: ""
To be equal to: error
      Which is: "Segmentation fault (core dumped)\n"
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:122: Failure
      Expected: 0
To be equal to: runProgram({pgm2, option, orcFile}, output, error)
      Which is: 1
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:123: Failure
      Expected: expected
      Which is: "{\"_a\": \"str1\", \"_c\": \"var1\"}\n{\"_a\": \"str2\", \"_c\": \"var2\"}\n"
To be equal to: output
      Which is: ""
With diff:
@@ -1,2 +1,1 @@
-{\"_a\": \"str1\", \"_c\": \"var1\"}
-{\"_a\": \"str2\", \"_c\": \"var2\"}\n
+""

/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:124: Failure
      Expected: ""
To be equal to: error
      Which is: "Caught exception in /tmp/test_csv_import_test_long_string.orc: File size too small\n"
[  FAILED  ] TestCSVFileImport.testLongString (687 ms)
[----------] 1 test from TestCSVFileImport (687 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (687 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TestCSVFileImport.testLongString

 1 FAILED TEST

I'm on Ubuntu 16.04.6. The invalid pointers might have undefined behaviors. I think it's ok if this catch the bug in some env but not all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try it on Mac?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it on macOS 10.13.6, the test passed with and without the fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't reproduce the problem on MacOS either.

Copy link
Contributor Author

@KyleGrains KyleGrains Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the clang++ compiled code is not reusing the invalidated address, so the referenced data is still there, undefined behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thank you for confirming, @stiga-huang and @KyleGrains .

Copy link
Contributor Author

@KyleGrains KyleGrains Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you too, @stiga-huang and @dongjoon-hyun .

Copy link
Contributor

@stiga-huang stiga-huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

"{\"_a\": \"str2\", \"_c\": \"var2\"}\n";
EXPECT_EQ(0, runProgram({pgm2, option, orcFile}, output, error));
EXPECT_EQ(expected, output);
EXPECT_EQ("", error);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the fix, the test failed in my env too

$ tools/test/tool-test --gtest_filter='*testLongString*'
ORC version: 1.8.0-SNAPSHOT
example dir = --gtest_filter=*testLongString*
build dir = .
Note: Google Test filter = *testLongString*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TestCSVFileImport
[ RUN      ] TestCSVFileImport.testLongString
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:113: Failure
      Expected: 0
To be equal to: runProgram({pgm1, schema, csvFile, orcFile}, output, error)
      Which is: 139
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:114: Failure
      Expected: ""
To be equal to: error
      Which is: "Segmentation fault (core dumped)\n"
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:122: Failure
      Expected: 0
To be equal to: runProgram({pgm2, option, orcFile}, output, error)
      Which is: 1
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:123: Failure
      Expected: expected
      Which is: "{\"_a\": \"str1\", \"_c\": \"var1\"}\n{\"_a\": \"str2\", \"_c\": \"var2\"}\n"
To be equal to: output
      Which is: ""
With diff:
@@ -1,2 +1,1 @@
-{\"_a\": \"str1\", \"_c\": \"var1\"}
-{\"_a\": \"str2\", \"_c\": \"var2\"}\n
+""

/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:124: Failure
      Expected: ""
To be equal to: error
      Which is: "Caught exception in /tmp/test_csv_import_test_long_string.orc: File size too small\n"
[  FAILED  ] TestCSVFileImport.testLongString (687 ms)
[----------] 1 test from TestCSVFileImport (687 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (687 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TestCSVFileImport.testLongString

 1 FAILED TEST

I'm on Ubuntu 16.04.6. The invalid pointers might have undefined behaviors. I think it's ok if this catch the bug in some env but not all.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

  • Thank you for your first contribution, @KyleGrains .
  • Thank you for your time during your review/collaboration/approval here, @stiga-huang .

Merged to master/1.7.

cc @williamhyun because he is the release manager of Apache ORC 1.7.3.

@dongjoon-hyun dongjoon-hyun merged commit 51d1a4d into apache:main Feb 21, 2022
dongjoon-hyun pushed a commit that referenced this pull request Feb 21, 2022
### What changes were proposed in this pull request?
Let each string type columns use its own databuffer before writing data into orc file, so that when one column is calling buffer.resize(), the invalidated buffer.data() is not affecting other columns which points to it.

### Why are the changes needed?
When using the csv-import tool to export a orc file with schema like "struct<a:string,b:binary>", if the data in column "b" has very long bytes (over 4MB), the process could segmentation fault or the exported data in column "a" could become empty string.

Following the code in CSVFileImport.cc, when writing a orc file, all string type columns is using one databuffer inside function fillStringValues(). If a data length is larger than the buffer, the buffer will be resized. The resize() operation will cause all previous result of buffer.data() become invalid.

In this case, when field "a" finished writing data into buffer, field "b" begin writing will resize the buffer, invalidate previous buffer.data(), so field "a"'s stringBatch is not refering to a valid buffer.data() any more.

A workaround could use different databuffers for each string type column, however requires allocating 4MB memory each. Alternatively using one same databuffer, but let all previous stringBatch re-points to databuffer's new address, after a resize() operation happens.

### How was this patch tested?
Prepare a csv file with two columns, while the second column has data larger than 4MB, like:
_str1,longbinaryabcd.........._
Run following command which finished successfully:
_$ csv-import "struct<a:string,b:binary>" Testbinary.csv /tmp/test.orc_
Run the following command shows data correctly:
_$ orc-contents /tmp/test.orc --columns=0_
_{"a": "str1"}_

(cherry picked from commit 51d1a4d)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

@KyleGrains I added you to the Apache ORC contributor group and assigned ORC-1116 to you.
Welcome to the Apache ORC community again.

@KyleGrains
Copy link
Contributor Author

@KyleGrains I added you to the Apache ORC contributor group and assigned ORC-1116 to you. Welcome to the Apache ORC community again.

Thanks, do I need to click any button on the issue page?

@dongjoon-hyun
Copy link
Member

No need from your side, @KyleGrains . :)

cxzl25 pushed a commit to cxzl25/orc that referenced this pull request Jan 11, 2024
…#1044)

### What changes were proposed in this pull request?
Let each string type columns use its own databuffer before writing data into orc file, so that when one column is calling buffer.resize(), the invalidated buffer.data() is not affecting other columns which points to it.

### Why are the changes needed?
When using the csv-import tool to export a orc file with schema like "struct<a:string,b:binary>", if the data in column "b" has very long bytes (over 4MB), the process could segmentation fault or the exported data in column "a" could become empty string.

Following the code in CSVFileImport.cc, when writing a orc file, all string type columns is using one databuffer inside function fillStringValues(). If a data length is larger than the buffer, the buffer will be resized. The resize() operation will cause all previous result of buffer.data() become invalid. 

In this case, when field "a" finished writing data into buffer, field "b" begin writing will resize the buffer, invalidate previous buffer.data(), so field "a"'s stringBatch is not refering to a valid buffer.data() any more.

A workaround could use different databuffers for each string type column, however requires allocating 4MB memory each. Alternatively using one same databuffer, but let all previous stringBatch re-points to databuffer's new address, after a resize() operation happens.

### How was this patch tested?
Prepare a csv file with two columns, while the second column has data larger than 4MB, like:
_str1,longbinaryabcd.........._
Run following command which finished successfully:
_$ csv-import "struct<a:string,b:binary>" Testbinary.csv /tmp/test.orc_
Run the following command shows data correctly:
_$ orc-contents /tmp/test.orc --columns=0_
_{"a": "str1"}_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants