Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions tools/src/CSVFileImport.cc
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
#include <algorithm>
#include <fstream>
#include <iostream>
#include <list>
#include <memory>
#include <getopt.h>
#include <string>
Expand Down Expand Up @@ -77,8 +78,8 @@ void fillStringValues(const std::vector<std::string>& data,
orc::ColumnVectorBatch* batch,
uint64_t numValues,
uint64_t colIndex,
orc::DataBuffer<char>& buffer,
uint64_t& offset) {
orc::DataBuffer<char>& buffer) {
uint64_t offset = 0;
orc::StringVectorBatch* stringBatch =
dynamic_cast<orc::StringVectorBatch*>(batch);
bool hasNull = false;
Expand Down Expand Up @@ -365,7 +366,8 @@ int main(int argc, char* argv[]) {
double totalElapsedTime = 0.0;
clock_t totalCPUTime = 0;

orc::DataBuffer<char> buffer(*orc::getDefaultPool(), 4 * 1024 * 1024);
typedef std::list<orc::DataBuffer<char>> DataBufferList;
DataBufferList bufferList;

orc::WriterOptions options;
options.setStripeSize(stripeSize);
Expand All @@ -385,7 +387,6 @@ int main(int argc, char* argv[]) {
std::ifstream finput(input.c_str());
while (!eof) {
uint64_t numValues = 0; // num of lines read in a batch
uint64_t bufferOffset = 0; // current offset in the string buffer

data.clear();
memset(rowBatch->notNull.data(), 1, batchSize);
Expand Down Expand Up @@ -420,13 +421,13 @@ int main(int argc, char* argv[]) {
case orc::STRING:
case orc::CHAR:
case orc::VARCHAR:
case orc::BINARY:
case orc::BINARY:
bufferList.emplace_back(*orc::getDefaultPool(), 1 * 1024 * 1024);
fillStringValues(data,
structBatch->fields[i],
numValues,
i,
buffer,
bufferOffset);
bufferList.back());
break;
case orc::FLOAT:
case orc::DOUBLE:
Expand Down
35 changes: 35 additions & 0 deletions tools/test/TestCSVFileImport.cc
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
#include "wrap/gmock.h"
#include "wrap/gtest-wrapper.h"

#include <fstream>

TEST (TestCSVFileImport, test10rows) {
// create an ORC file from importing the CSV file
const std::string pgm1 = findProgram("tools/src/csv-import");
Expand Down Expand Up @@ -90,3 +92,36 @@ TEST (TestCSVFileImport, testTimezoneOption) {
EXPECT_EQ("", error);
}
}

TEST (TestCSVFileImport, testLongString) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this.

// create an ORC file from importing the CSV file
const std::string pgm1 = findProgram("tools/src/csv-import");
const std::string csvFile = "/tmp/test_csv_import_test_long_string.csv";
const std::string orcFile = "/tmp/test_csv_import_test_long_string.orc";
const std::string schema = "'struct<_a:string,b_:binary,_c:varchar(10)>'";
std::string output;
std::string error;

std::ofstream csvFileStream(csvFile, std::ios::binary | std::ios::out | std::ios::trunc);
if(csvFileStream.is_open())
{
std::string longStr;
longStr.resize(4 * 1024 * 1024 + 1, 'x');
csvFileStream << "str1," << longStr << ",var1\n";
csvFileStream << "str2," << longStr << ",var2\n";
csvFileStream.close();
}

EXPECT_EQ(0, runProgram({pgm1, schema, csvFile, orcFile}, output, error));
EXPECT_EQ("", error);

// verify the ORC file content
const std::string pgm2 = findProgram("tools/src/orc-contents");
std::string option = "--columns=0,2";
const std::string expected =
"{\"_a\": \"str1\", \"_c\": \"var1\"}\n"
"{\"_a\": \"str2\", \"_c\": \"var2\"}\n";
EXPECT_EQ(0, runProgram({pgm2, option, orcFile}, output, error));
EXPECT_EQ(expected, output);
EXPECT_EQ("", error);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. The test case seems to pass without your patch still.

$ make package test-out
...
Test project /Users/dongjoon/APACHE/orc-merge/build
    Start 1: orc-test
1/2 Test #1: orc-test .........................   Passed    2.91 sec
    Start 2: tool-test
2/2 Test #2: tool-test ........................   Passed   12.29 sec

100% tests passed, 0 tests failed out of 2

Total Test time (real) =  15.21 sec
Built target test-out

$ git diff main --stat
 tools/test/TestCSVFileImport.cc | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

The valid test case should fail on main branch without your patch. Could you take a look at this once more and make it sure?

Copy link
Contributor Author

@KyleGrains KyleGrains Feb 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is a code problem or some other issues, the output like:
$make package test-out
...
Test project /home/kai/github/orc/build
Start 1: orc-test
1/2 Test #1: orc-test ......................... Passed 7.05 sec
Start 2: tool-test
2/2 Test #2: tool-test ........................***Failed 17.12 sec
ORC version: 1.8.0-SNAPSHOT
example dir = /home/kai/github/orc/examples
build dir = /home/kai/github/orc/build
[==========] Running 85 tests from 10 test cases.
[----------] Global test environment set-up.
[----------] 3 tests from TestCSVFileImport
[ RUN ] TestCSVFileImport.test10rows
[ OK ] TestCSVFileImport.test10rows (15 ms)
[ RUN ] TestCSVFileImport.testTimezoneOption
[ OK ] TestCSVFileImport.testTimezoneOption (29 ms)
[ RUN ] TestCSVFileImport.testLongString
/home/kai/github/orc/tools/test/TestCSVFileImport.cc:124: Failure
Expected: 0
To be equal to: runProgram({pgm2, option, orcFile}, output, error)
Which is: 1
/home/kai/github/orc/tools/test/TestCSVFileImport.cc:125: Failure
Expected: expected
Which is: "{"_a": "str1", "_c": "var1"}\n{"_a": "str2", "_c": "var2"}\n"
To be equal to: output
Which is: ""
With diff:
@@ -1,2 +1,1 @@
-{"_a": "str1", "_c": "var1"}
-{"_a": "str2", "_c": "var2"}\n
+""

/home/kai/github/orc/tools/test/TestCSVFileImport.cc:126: Failure
Expected: ""
To be equal to: error
Which is: "Caught exception in /tmp/test_csv_import_test_long_string.orc: File size too small\n"
[ FAILED ] TestCSVFileImport.testLongString (58 ms)

$git diff main --stat
tools/test/TestCSVFileImport.cc | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

$md5sum tools/test/TestCSVFileImport.cc
73b578e945183da5c3fdd8d16afda5c6 tools/test/TestCSVFileImport.cc

$cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the fix, the test failed in my env too

$ tools/test/tool-test --gtest_filter='*testLongString*'
ORC version: 1.8.0-SNAPSHOT
example dir = --gtest_filter=*testLongString*
build dir = .
Note: Google Test filter = *testLongString*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TestCSVFileImport
[ RUN      ] TestCSVFileImport.testLongString
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:113: Failure
      Expected: 0
To be equal to: runProgram({pgm1, schema, csvFile, orcFile}, output, error)
      Which is: 139
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:114: Failure
      Expected: ""
To be equal to: error
      Which is: "Segmentation fault (core dumped)\n"
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:122: Failure
      Expected: 0
To be equal to: runProgram({pgm2, option, orcFile}, output, error)
      Which is: 1
/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:123: Failure
      Expected: expected
      Which is: "{\"_a\": \"str1\", \"_c\": \"var1\"}\n{\"_a\": \"str2\", \"_c\": \"var2\"}\n"
To be equal to: output
      Which is: ""
With diff:
@@ -1,2 +1,1 @@
-{\"_a\": \"str1\", \"_c\": \"var1\"}
-{\"_a\": \"str2\", \"_c\": \"var2\"}\n
+""

/home/quanlong/workspace/orc/tools/test/TestCSVFileImport.cc:124: Failure
      Expected: ""
To be equal to: error
      Which is: "Caught exception in /tmp/test_csv_import_test_long_string.orc: File size too small\n"
[  FAILED  ] TestCSVFileImport.testLongString (687 ms)
[----------] 1 test from TestCSVFileImport (687 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (687 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] TestCSVFileImport.testLongString

 1 FAILED TEST

I'm on Ubuntu 16.04.6. The invalid pointers might have undefined behaviors. I think it's ok if this catch the bug in some env but not all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try it on Mac?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it on macOS 10.13.6, the test passed with and without the fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't reproduce the problem on MacOS either.

Copy link
Contributor Author

@KyleGrains KyleGrains Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the clang++ compiled code is not reusing the invalidated address, so the referenced data is still there, undefined behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thank you for confirming, @stiga-huang and @KyleGrains .

Copy link
Contributor Author

@KyleGrains KyleGrains Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you too, @stiga-huang and @dongjoon-hyun .

}