Support headers for TSV/CSV files#4198
Support headers for TSV/CSV files#4198fjy wants to merge 9 commits intoapache:masterfrom fjy:supportheader
Conversation
jihoonson
left a comment
There was a problem hiding this comment.
Useful feature! I left a single comment. And please check the test failure.
| } | ||
|
|
||
| lineIterator = lineIterators.next(); | ||
| parser.reset(); |
There was a problem hiding this comment.
Please reset the parser at here too. Also, lineIterator must be closed before assigning new one there.
gianm
left a comment
There was a problem hiding this comment.
Other than the comments, couple other things:
- Please include unit tests for this feature in CSVParserTest & DelimitedParserTest
- The travis test failures look legit
|
|
||
| /** | ||
| * Resets state within a parser. | ||
| */ |
There was a problem hiding this comment.
This should be a default method since it's an interface used by druid-api. (will make a lot of the implementations simpler too)
There was a problem hiding this comment.
The javadoc for this method should say when it's going to be called: something like "at the start of each new file after the first one". It should also specify if it's going to be called before the very first call to parse or not. IMO, we shouldn't promise anything in particular, and say something like: "may or may not be called before the very first file"
| @JsonProperty("listDelimiter") String listDelimiter, | ||
| @JsonProperty("columns") List<String> columns | ||
| @JsonProperty("columns") List<String> columns, | ||
| @JsonProperty("firstRowIsHeader") boolean firstRowIsHeader |
There was a problem hiding this comment.
How about calling this hasHeaderRow? (It might not be the first row… maybe we'll have some "skip N rows" configs at some point)
| public Parser<String, Object> makeParser() | ||
| { | ||
| return new CSVParser(Optional.fromNullable(listDelimiter), columns); | ||
| if (firstRowIsHeader) { |
There was a problem hiding this comment.
This could just be return new CSVParser(Optional.fromNullable(listDelimiter), columns, firstRowIsHeader)
| @JsonProperty("listDelimiter") String listDelimiter, | ||
| @JsonProperty("columns") List<String> columns | ||
| @JsonProperty("columns") List<String> columns, | ||
| @JsonProperty("firstRowIsHeader") boolean firstRowIsHeader |
| ); | ||
| retVal.setFieldNames(columns); | ||
| return retVal; | ||
| if (firstRowIsHeader) { |
| private final Function<String, Object> valueFunction; | ||
|
|
||
| private ArrayList<String> fieldNames = null; | ||
| private boolean firstRowIsHeader = false; |
| private final au.com.bytecode.opencsv.CSVParser parser = new au.com.bytecode.opencsv.CSVParser(); | ||
|
|
||
| private ArrayList<String> fieldNames = null; | ||
| private boolean firstRowIsHeader = false; |
| { | ||
| ParserUtils.validateFields(fieldNames); | ||
| this.fieldNames = Lists.newArrayList(fieldNames); | ||
| if (fieldNames != null) { |
There was a problem hiding this comment.
Why ignore null fieldNames here?
| @Override | ||
| public void reset() | ||
| { | ||
| hasParsedHeader = !firstRowIsHeader; |
There was a problem hiding this comment.
This could just be hasParsedHeader = false right? Seems simpler
| InputRow row = delegateFirehose.nextRow(); | ||
|
|
||
| if (row == null) { | ||
| continue; |
There was a problem hiding this comment.
Please include a comment about why.
There was a problem hiding this comment.
I think this should be removed because ReplayableFirehose is able to simply replay the original data without filtering any rows. Index tasks should decide that the given row will be skipped or not.
There was a problem hiding this comment.
@gianm I realized that this is to avoid marshalling nulls in ReplayableFirehose because jackson seems to not support it. I think we have two options.
- Removing ReplayableFirehose. It will be removed in Add PrefetchableTextFilesFirehoseFactory for cloud storage types #4193 anyway.
- Skip null rows in ReplayableFirehose as well.
I think the first option is better. What do you think?
There was a problem hiding this comment.
I think since we're planning to remove ReplayableFirehose in #4193, and it's currently just an implementation detail of the IndexTask, we could skip nulls here for now, and expect the class to be removed completely later.
There was a problem hiding this comment.
Agree. I'll add a comment here.
|
@gianm @jihoonson updated |
|
@fjy The unit tests are still failing. Please run them locally. |
|
@gianm passing now |
gianm
left a comment
There was a problem hiding this comment.
Code LGTM. Have some comments on the docs, which I feel will be confusing.
| ``` | ||
|
|
||
| The `columns` field must match the columns of your input data in the same order. | ||
| #### CSV Index Tasks |
There was a problem hiding this comment.
People aren't going to understand what this means. I'd suggest being much more explicit. A sentence like this should appear:
The hasHeaderRow parameter is only supported for tasks of type "index", and is ignored for any other type, including any realtime or Hadoop-based tasks.
| If your file does not have a header as the first line of the file, you must set the `columns` field and ensure that the order of the fields matches the columns of your input data in the same order. | ||
| If your file does have a header, you can set a field called `hasHeaderRow` to true, and do not include the `columns` key. | ||
|
|
||
| Be sure to change the `delimiter` to the appropriate delimiter for your data. Like CSV, you must specify the columns and which subset of the columns you want indexed. |
There was a problem hiding this comment.
This delimiter note applies to all kinds of tasks, but being under this header suggests it only applies to "index" tasks.
|
Tagged Design Review as we're altering an interface and adding a config parameter. |
|
@jihoonson let me know when i should close this PR |
|
Closed in favor of #4254 |
No description provided.