Skip to content

Support headers for TSV/CSV files#4198

Closed
fjy wants to merge 9 commits intoapache:masterfrom
fjy:supportheader
Closed

Support headers for TSV/CSV files#4198
fjy wants to merge 9 commits intoapache:masterfrom
fjy:supportheader

Conversation

@fjy
Copy link
Copy Markdown
Contributor

@fjy fjy commented Apr 23, 2017

No description provided.

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful feature! I left a single comment. And please check the test failure.

}

lineIterator = lineIterators.next();
parser.reset();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reset the parser at here too. Also, lineIterator must be closed before assigning new one there.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the comments, couple other things:

  • Please include unit tests for this feature in CSVParserTest & DelimitedParserTest
  • The travis test failures look legit


/**
* Resets state within a parser.
*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a default method since it's an interface used by druid-api. (will make a lot of the implementations simpler too)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The javadoc for this method should say when it's going to be called: something like "at the start of each new file after the first one". It should also specify if it's going to be called before the very first call to parse or not. IMO, we shouldn't promise anything in particular, and say something like: "may or may not be called before the very first file"

@JsonProperty("listDelimiter") String listDelimiter,
@JsonProperty("columns") List<String> columns
@JsonProperty("columns") List<String> columns,
@JsonProperty("firstRowIsHeader") boolean firstRowIsHeader
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling this hasHeaderRow? (It might not be the first row… maybe we'll have some "skip N rows" configs at some point)

public Parser<String, Object> makeParser()
{
return new CSVParser(Optional.fromNullable(listDelimiter), columns);
if (firstRowIsHeader) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be return new CSVParser(Optional.fromNullable(listDelimiter), columns, firstRowIsHeader)

@JsonProperty("listDelimiter") String listDelimiter,
@JsonProperty("columns") List<String> columns
@JsonProperty("columns") List<String> columns,
@JsonProperty("firstRowIsHeader") boolean firstRowIsHeader
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comments to CSV

);
retVal.setFieldNames(columns);
return retVal;
if (firstRowIsHeader) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if isn't needed

private final Function<String, Object> valueFunction;

private ArrayList<String> fieldNames = null;
private boolean firstRowIsHeader = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be final.

private final au.com.bytecode.opencsv.CSVParser parser = new au.com.bytecode.opencsv.CSVParser();

private ArrayList<String> fieldNames = null;
private boolean firstRowIsHeader = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be final.

{
ParserUtils.validateFields(fieldNames);
this.fieldNames = Lists.newArrayList(fieldNames);
if (fieldNames != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why ignore null fieldNames here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm @fjy should null fieldNames be ignored? Other Parsers seem not to do.

@Override
public void reset()
{
hasParsedHeader = !firstRowIsHeader;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could just be hasParsedHeader = false right? Seems simpler

InputRow row = delegateFirehose.nextRow();

if (row == null) {
continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a comment about why.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be removed because ReplayableFirehose is able to simply replay the original data without filtering any rows. Index tasks should decide that the given row will be skipped or not.

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm I realized that this is to avoid marshalling nulls in ReplayableFirehose because jackson seems to not support it. I think we have two options.

I think the first option is better. What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think since we're planning to remove ReplayableFirehose in #4193, and it's currently just an implementation detail of the IndexTask, we could skip nulls here for now, and expect the class to be removed completely later.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I'll add a comment here.

@gianm gianm added the Feature label Apr 25, 2017
@fjy
Copy link
Copy Markdown
Contributor Author

fjy commented Apr 26, 2017

@gianm @jihoonson updated

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Apr 26, 2017

@fjy The unit tests are still failing. Please run them locally.

@fjy
Copy link
Copy Markdown
Contributor Author

fjy commented Apr 27, 2017

@gianm passing now

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM. Have some comments on the docs, which I feel will be confusing.

```

The `columns` field must match the columns of your input data in the same order.
#### CSV Index Tasks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People aren't going to understand what this means. I'd suggest being much more explicit. A sentence like this should appear:

The hasHeaderRow parameter is only supported for tasks of type "index", and is ignored for any other type, including any realtime or Hadoop-based tasks.

If your file does not have a header as the first line of the file, you must set the `columns` field and ensure that the order of the fields matches the columns of your input data in the same order.
If your file does have a header, you can set a field called `hasHeaderRow` to true, and do not include the `columns` key.

Be sure to change the `delimiter` to the appropriate delimiter for your data. Like CSV, you must specify the columns and which subset of the columns you want indexed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This delimiter note applies to all kinds of tasks, but being under this header suggests it only applies to "index" tasks.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented May 4, 2017

Tagged Design Review as we're altering an interface and adding a config parameter.

@fjy
Copy link
Copy Markdown
Contributor Author

fjy commented May 10, 2017

@jihoonson let me know when i should close this PR

@jihoonson
Copy link
Copy Markdown
Contributor

@fjy I've just merged this patch into #4254.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented May 23, 2017

Closed in favor of #4254

@gianm gianm closed this May 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants