Support headers for TSV/CSV files by fjy · Pull Request #4198 · apache/druid

fjy · 2017-04-23T22:32:34Z

No description provided.

jihoonson

Useful feature! I left a single comment. And please check the test failure.

jihoonson · 2017-04-24T01:35:14Z

        }

        lineIterator = lineIterators.next();
+        parser.reset();


Please reset the parser at here too. Also, lineIterator must be closed before assigning new one there.

gianm

Other than the comments, couple other things:

Please include unit tests for this feature in CSVParserTest & DelimitedParserTest
The travis test failures look legit

gianm · 2017-04-24T23:48:34Z


+  /**
+   * Resets state within a parser.
+   */


This should be a default method since it's an interface used by druid-api. (will make a lot of the implementations simpler too)

The javadoc for this method should say when it's going to be called: something like "at the start of each new file after the first one". It should also specify if it's going to be called before the very first call to parse or not. IMO, we shouldn't promise anything in particular, and say something like: "may or may not be called before the very first file"

gianm · 2017-04-25T00:17:58Z

      @JsonProperty("listDelimiter") String listDelimiter,
-      @JsonProperty("columns") List<String> columns
+      @JsonProperty("columns") List<String> columns,
+      @JsonProperty("firstRowIsHeader") boolean firstRowIsHeader


How about calling this hasHeaderRow? (It might not be the first row… maybe we'll have some "skip N rows" configs at some point)

gianm · 2017-04-25T00:21:51Z

  public Parser<String, Object> makeParser()
  {
-    return new CSVParser(Optional.fromNullable(listDelimiter), columns);
+    if (firstRowIsHeader) {


This could just be return new CSVParser(Optional.fromNullable(listDelimiter), columns, firstRowIsHeader)

gianm · 2017-04-25T00:22:12Z

      @JsonProperty("listDelimiter") String listDelimiter,
-      @JsonProperty("columns") List<String> columns
+      @JsonProperty("columns") List<String> columns,
+      @JsonProperty("firstRowIsHeader") boolean firstRowIsHeader


Similar comments to CSV

gianm · 2017-04-25T00:22:27Z

-    );
-    retVal.setFieldNames(columns);
-    return retVal;
+    if (firstRowIsHeader) {


The if isn't needed

gianm · 2017-04-25T00:33:17Z

  private final Function<String, Object> valueFunction;

  private ArrayList<String> fieldNames = null;
+  private boolean firstRowIsHeader = false;


Should be final.

gianm · 2017-04-25T00:33:22Z

  private final au.com.bytecode.opencsv.CSVParser parser = new au.com.bytecode.opencsv.CSVParser();

  private ArrayList<String> fieldNames = null;
+  private boolean firstRowIsHeader = false;


Should be final.

gianm · 2017-04-25T00:34:10Z

  {
-    ParserUtils.validateFields(fieldNames);
-    this.fieldNames = Lists.newArrayList(fieldNames);
+    if (fieldNames != null) {


Why ignore null fieldNames here?

@gianm @fjy should null fieldNames be ignored? Other Parsers seem not to do.

gianm · 2017-04-25T00:36:17Z

+  @Override
+  public void reset()
+  {
+    hasParsedHeader = !firstRowIsHeader;


This could just be hasParsedHeader = false right? Seems simpler

gianm · 2017-04-25T00:37:00Z

                  InputRow row = delegateFirehose.nextRow();
+
+                  if (row == null) {
+                    continue;


Please include a comment about why.

I think this should be removed because ReplayableFirehose is able to simply replay the original data without filtering any rows. Index tasks should decide that the given row will be skipped or not.

@gianm I realized that this is to avoid marshalling nulls in ReplayableFirehose because jackson seems to not support it. I think we have two options.

Removing ReplayableFirehose. It will be removed in Add PrefetchableTextFilesFirehoseFactory for cloud storage types #4193 anyway.

Skip null rows in ReplayableFirehose as well.

I think the first option is better. What do you think?

I think since we're planning to remove ReplayableFirehose in #4193, and it's currently just an implementation detail of the IndexTask, we could skip nulls here for now, and expect the class to be removed completely later.

Agree. I'll add a comment here.

fjy · 2017-04-26T01:04:50Z

@gianm @jihoonson updated

gianm · 2017-04-26T17:19:20Z

@fjy The unit tests are still failing. Please run them locally.

fjy · 2017-04-27T14:50:27Z

@gianm passing now

gianm

Code LGTM. Have some comments on the docs, which I feel will be confusing.

gianm · 2017-05-04T22:26:03Z

 ```

-The `columns` field must match the columns of your input data in the same order.
+#### CSV Index Tasks


People aren't going to understand what this means. I'd suggest being much more explicit. A sentence like this should appear:

The hasHeaderRow parameter is only supported for tasks of type "index", and is ignored for any other type, including any realtime or Hadoop-based tasks.

gianm · 2017-05-04T22:26:54Z

+If your file does not have a header as the first line of the file, you must set the `columns` field and ensure that the order of the fields matches the columns of your input data in the same order.
+If your file does have a header, you can set a field called `hasHeaderRow` to true, and do not include the `columns` key.

 Be sure to change the `delimiter` to the appropriate delimiter for your data. Like CSV, you must specify the columns and which subset of the columns you want indexed.


This delimiter note applies to all kinds of tasks, but being under this header suggests it only applies to "index" tasks.

gianm · 2017-05-04T22:28:51Z

Tagged Design Review as we're altering an interface and adding a config parameter.

fjy · 2017-05-10T00:02:15Z

@jihoonson let me know when i should close this PR

jihoonson · 2017-05-10T01:36:40Z

@fjy I've just merged this patch into #4254.

gianm · 2017-05-23T08:03:45Z

Closed in favor of #4254

fjy added 3 commits April 20, 2017 16:25

initial commit

7bd6077

small fixes

c3965ed

fix bug

f68be6b

jihoonson requested changes Apr 24, 2017

View reviewed changes

gianm reviewed Apr 25, 2017

View reviewed changes

gianm added the Feature label Apr 25, 2017

fjy added 4 commits April 25, 2017 15:23

fix bug

e188eea

address code review

96374cb

more cr

8bbb5d3

more cr

aeb0fc6

more cr

7e7d6f0

fix

6ad82f5

gianm requested changes May 4, 2017

View reviewed changes

gianm added the Design Review label May 4, 2017

jihoonson mentioned this pull request May 6, 2017

Add support for headers and skipping thereof for CSV and TSV #4254

Merged

fjy closed this May 10, 2017

fjy reopened this May 10, 2017

gianm closed this May 23, 2017

Conversation

fjy commented Apr 23, 2017

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson May 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjy commented Apr 26, 2017

Uh oh!

gianm commented Apr 26, 2017

Uh oh!

fjy commented Apr 27, 2017

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm commented May 4, 2017

Uh oh!

fjy commented May 10, 2017

Uh oh!

jihoonson commented May 10, 2017

Uh oh!

gianm commented May 23, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jihoonson May 9, 2017 •

edited

Loading