-
Notifications
You must be signed in to change notification settings - Fork 506
ORC-29. Enable ColumnPrinter to print only selected columns. #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
0b429fa
b41b8f1
9e7f280
2773e15
eba3b24
fb0acce
4f78e54
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,8 +31,9 @@ void printContents(const char* filename, const orc::ReaderOptions opts) { | |
|
|
||
| std::unique_ptr<orc::ColumnVectorBatch> batch = reader->createRowBatch(1000); | ||
| std::string line; | ||
| const std::vector<bool> selectedColumns = reader->getSelectedColumns(); | ||
| std::unique_ptr<orc::ColumnPrinter> printer = | ||
| createColumnPrinter(line, reader->getType()); | ||
| createColumnPrinter(line, reader->getType(), &selectedColumns); | ||
|
|
||
| while (reader->next(*batch)) { | ||
| printer->reset(*batch); | ||
|
|
@@ -48,12 +49,36 @@ void printContents(const char* filename, const orc::ReaderOptions opts) { | |
|
|
||
| int main(int argc, char* argv[]) { | ||
| if (argc < 2) { | ||
| std::cout << "Usage: file-contents <filename>\n"; | ||
| std::cout << "Usage: file-contents <filename> [--columns=1,2,...]\n" | ||
| << "Print contents of <filename>.\n" | ||
| << "If columns are specified, only these top-level (logical) columns are printed.\n" ; | ||
| return 1; | ||
| } | ||
| try { | ||
| const std::string COLUMNS_PREFIX = "--columns="; | ||
| std::list<int64_t> cols; | ||
| char* filename = ORC_NULLPTR; | ||
|
|
||
| // Read command-line options | ||
| char *param, *value; | ||
| for (int i = 1; i < argc; i++) { | ||
| if ( (param = std::strstr(argv[i], COLUMNS_PREFIX.c_str())) ) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What are the semantics? Are the fields above the selected ones automatically included? What about the types below the selected one?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be much more user friendly to select by column name rather than column id, which given complex types are hard to know. At that point, you might start with only selecting top level columns with something like "--columns=field1,field12", which would mean all of the types under those types. Eventually, it would be nice to support virtual column names like "length" and "value" for lists, and "length, "key", and "value" for maps. Nested structures would look like "outer12.inner3" or "outer12.key".
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, originally I wanted to use column names, too. Unfortunately, they are optional. If column names are missing, how would a user select specific columns? |
||
| value = std::strtok(param+COLUMNS_PREFIX.length(), "," ); | ||
| while (value) { | ||
| cols.push_back(std::atoi(value)); | ||
| value = std::strtok(nullptr, "," ); | ||
| } | ||
| } else { | ||
| filename = argv[i]; | ||
| } | ||
| } | ||
| orc::ReaderOptions opts; | ||
| printContents(argv[1], opts); | ||
| if (cols.size() > 0) { | ||
| opts.include(cols); | ||
| } | ||
| if (filename != ORC_NULLPTR) { | ||
| printContents(filename, opts); | ||
| } | ||
| } catch (std::exception& ex) { | ||
| std::cerr << "Caught exception: " << ex.what() << "\n"; | ||
| return 1; | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also need to push the selected columns down through the list, map, and union types. Otherwise, you won't be able to select columns below them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is the intended implementation: only specify top-level (logical) columns. Otherwise, there is no way to distinguish between logical and physical columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if an ORC file contains columns INT, STRUCT<STRING, BOOLEAN>, running
./file-contents --columns=2 file.orc
will select the STRUCT column. If we allowed selection of subcolumns, then it is unclear which column the above command will select: STRUCT or STRING.