Add support for headers and skipping thereof for CSV and TSV by jihoonson · Pull Request #4254 · apache/druid

jihoonson · 2017-05-06T07:05:27Z

This pr is similar to #4198, but different. With this pr, users can simply skip head rows rather than guessing the schema from the head rows. It's still useful when some log formats start with comment lines like https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/AccessLogs.html#LogFileFormat. The maxNumSkipHeadRows is effective for only non-Hadoop index tasks.

This change is

gianm

@jihoonson, in addition to the comments, could you please merge the changes from #4198 into this PR as well? They seem related and there'd be conflicts between the two patches otherwise. I spoke with @fjy offline and he would close #4198 if you did merge that patch into this one.

gianm · 2017-05-07T17:57:10Z

 {
  private final String listDelimiter;
  private final List<String> columns;
+  private final Integer maxNumSkipHeadRows;


Let's make this an int and default to 0 (which Jackson would do anyway)

Similar comment on other parse specs.

gianm · 2017-05-07T17:57:43Z

      @JsonProperty("listDelimiter") String listDelimiter,
-      @JsonProperty("columns") List<String> columns
+      @JsonProperty("columns") List<String> columns,
+      @JsonProperty("maxNumSkipHeadRows") Integer maxNumSkipHeadRows


How about skipHeaderRows? (Similar to the naming of hasHeaderRow of #4198).

gianm · 2017-05-07T17:58:44Z

    this.lineIterators = lineIterators;
    this.parser = parser;
+    final ParseSpec parseSpec = parser.getParseSpec();
+    if (parseSpec instanceof CSVParseSpec) {


Hmm, let's think of a way to do this without instanceof.

In #4198 the strategy is for the parser to return null for rows we want to skip. Would that work here too?

Hmm. Returning nulls looks to affect to the query result in Druid. If so, I think returning nulls and skipping some rows are different. For example, a query for counting all rows will return different results.

In #4198, the entire row would be returned as null, and I think @fjy set up the readers such that if they got a null row they would skip the row. So it has the effect of skipping and not affecting the query result.

Ah, I see.. But I'm still not sure that's a better way. I prefer to not use nulls for representing something meaningful because it is difficult to check that a null value is an expected result or an error. Is it for sharing some common code paths?

I think the rationale was that we want to avoid instanceof (since it's brittle) and we want to avoid changing the signature of the Parser.parse method (since it's in druid-api and those interfaces are meant to be stable). So returning nulls to signal "skip me" accomplishes those.

I think it's ok for a Parser to behave that way, since it's supposed to throw ParseException if there was really a parse error. Returning null to signal "skip me" seems reasonable.

I'm definitely open to other suggestions though. If you think it's best to change Parser.parse we could perhaps deprecate it and add a new method that returns something like ParseResult.

Ah, yeah. I was thinking that ParseSpec should describe the behavior of the parser which should be consistent no matter what the type of index task is. It would be more intuitive and reduce user mistakes. Even though ParseSpec is in druid-api, I think we need to support a consistent behavior if it is provided by default. What do you think?

Is your idea that an indexing mechanism that doesn't support hasHeaderRow / skipHeaderRows would just ignore them completely? Maybe we could achieve that by having a method like parser.startFile() that FileIteratingFirehose calls, but none of the other Parser callers call. And if it's not called, the Parsers would ignore those header row related fields.

Or were you thinking of a different kind of smarter behavior?

Well, I thought we can throw an error, or at least we need to leave a log that hasHeaderRow / skipHeaderRows would be ignored so that users can figure out what happened by reading their log.

If we do the parser.startFile() thing, then I suppose a Parser could throw an error if it has hasHeaderRow / skipHeaderRows set, and has its parse() method called without an earlier startFile() call. Since that would mean it's being used in "stream mode" with parameters that only make sense for files.

Ah I see. It sounds good. I'll update this patch soon.
Thanks!

gianm · 2017-05-07T18:13:47Z

 */
 public class FileIteratingFirehose implements Firehose
 {
+  private static final int DEFAULT_NUM_SKIP_HEAD_ROWS = 0;


This shouldn't be necessary if the parsers default to skip 0.

gianm · 2017-05-07T18:17:04Z

 ```

 The `columns` field must match the columns of your input data in the same order.
+Additionally, you can set the number of head rows to be skipped to `maxNumSkipHeadRows`. Note that this option is effective


I would go further and say it's only effective for the non-Hadoop batch "index" task. Because it's also not useful for non-Hadoop realtime tasks; they wouldn't use FileIteratingFirehose.

Right. I updated the comment.

jihoonson · 2017-05-09T01:31:36Z

@gianm it looks that code reviews for #4198 are almost done except some comments on documents. If so, I'll merge it to this pr.

gianm · 2017-05-09T01:40:54Z

@jihoonson, yeah, code reviews are OK and just some doc changes were needed.

…p-head-rows

jihoonson · 2017-05-10T01:26:52Z

@gianm I merged #4198 and addressed your comments.

gianm · 2017-05-10T08:36:06Z

+Instead, you can set the `hasHeaderRow` field to true, which makes Druid automatically extract the column information from the header.
+Otherwise, you must set the `columns` field and ensure that field must match the columns of your input data in the same order.
+
+Also, you can skip some header rows by setting `skipHeaderRows` in your parseSpec. Note that `hasHeaderRow` and `skipHeaderRows` are effective


Please include a couple clarifications.

What happens if you provide these on a task where they aren't supported?

If you provide both skipHeaderRows and hasHeaderRow, which is applied first? Skip first, then read a header row? Or read a header row and then skip?

Docs look good to me other than that.

Good point. I updated the doc.

gianm · 2017-05-11T04:16:52Z

+  }
+
+  @JsonProperty("skipHeaderRows")
+  public Integer getSkipHeaderRows()


Could be a primitive int

gianm · 2017-05-11T04:22:14Z

-   * Parse a String into a Map.
+   * Initialize this parser for centralized batch processing of files like IndexTask.
+   */
+  default void startFileFromBeginning()


Do we really need both this and reset()? I would think that we can get rid of reset() and instead, call startFileFromBeginning() every time a new file starts. imo, reset isn't used for anything else so it's needless to keep it.

Removed reset().

gianm · 2017-05-11T04:24:26Z

    this.parseSpec = parseSpec;
    this.mapParser = new MapInputRowParser(parseSpec);
    this.parser = parseSpec.makeParser();
+    parser.startFileFromBeginning();


StringInputRowParser is used by parser options that aren't file-oriented (you can use it on streams etc) so this isn't a good place to put this. imo, this should replace reset() and be called in places that reset() is currently called (like FileIteratingFirehose). With one addition: it needs to be called before the very first file too.

Right. I moved to FileIteratingFirehose.

gianm · 2017-05-11T04:25:19Z

thx @jihoonson, just re-reviewed.

jihoonson · 2017-05-11T07:29:25Z

@gianm thanks. I addressed your comments.

gianm

👍 LGTM (code & design)

…head-rows

leventov

👍 for design, textual comments

leventov · 2017-05-16T00:02:05Z

  public Parser<String, Object> makeParser()
  {
-    return new CSVParser(Optional.fromNullable(listDelimiter), columns);
+    return new CSVParser(Optional.fromNullable(listDelimiter), columns, hasHeaderRow, skipHeaderRows);


(Optional comment) Note that Optional is not recommended to be used as a method/constructor parameter: http://stackoverflow.com/a/26328555/648955.

I found some more methods which receive an Optional as its parameter like TaskLockBox.tryLock() or OverlordResource.asLeaderWith(). I think it would be better to fix these codes all together. I'll open an issue for it.

Opened an issue: #4275

leventov · 2017-05-16T00:06:01Z

 |`keyColumn`|The name of the column containing the key|no|The first column|
 |`valueColumn`|The name of the column containing the value|no|The second column|
+|`hasHeaderRow`|A flag to indicate that column information can be extracted from the input files' header row|no|false|
+|`skipHeaderRows`|Number of header rows to be skipped|no|0|


Just reading this doc, it's not clear how hasHeaderRow interacts with skipHeaderRows: if rows first skipped, and then skipHeaderRows+1-th row is used as the header row, or the very first row is used as the header row, and then skipHeaderRows rows are skipped.

Ok, it's explained in other place, but I think should be explained here is well

Added a comment.

leventov · 2017-05-16T00:06:58Z

-
 |Parameter|Description|Required|Default|
 |---------|-----------|--------|-------|
 |`columns`|The list of columns in the csv file|yes|`null`|


No similar caveat as for csv

leventov · 2017-05-16T00:08:43Z

 |`delimiter`|The delimiter in the file|no|tab (`\t`)|
 |`listDelimiter`|The list delimiter in the file|no| (`\u0001`)|
+|`hasHeaderRow`|A flag to indicate that column information can be extracted from the input files' header row|no|false|
+|`skipHeaderRows`|Number of header rows to be skipped|no|0|


Same as for csv

Added a comment.

leventov · 2017-05-16T00:08:58Z

+Otherwise, you must set the `columns` field and ensure that field must match the columns of your input data in the same order.
+
+Also, you can skip some header rows by setting `skipHeaderRows` in your parseSpec. If both `skipHeaderRows` and `hasHeaderRow` options are set,
+`skipHeaderRows` is fist applied. For example, if you set `skipHeaderRows` to 2 and `hasHeaderRow` to true, Druid will


Typo: first

leventov · 2017-05-16T00:09:50Z

+Otherwise, you must set the `columns` field and ensure that field must match the columns of your input data in the same order.
+
+Also, you can skip some header rows by setting `skipHeaderRows` in your parseSpec. If both `skipHeaderRows` and `hasHeaderRow` options are set,
+`skipHeaderRows` is fist applied. For example, if you set `skipHeaderRows` to 2 and `hasHeaderRow` to true, Druid will


jihoonson · 2017-05-16T00:49:54Z

@leventov thanks for your review. I addressed your comments.

jon-wei

Looks good, +1 on design review

leventov · 2017-06-08T22:25:14Z

      @JsonProperty("dimensionsSpec") DimensionsSpec dimensionsSpec,
      @JsonProperty("listDelimiter") String listDelimiter,
-      @JsonProperty("columns") List<String> columns
+      @JsonProperty("columns") List<String> columns,


Is it backwards compatible to update to add fields to Json form? Maybe need to add legacy constructor without those fields? Also it is used in tranquility: https://github.com/druid-io/tranquility/blob/63eb64e4cf96e62abdb9e784ce7b07c90f83ebb4/server/src/test/scala/com/metamx/tranquility/server/TranquilityServletTest.scala#L277

Just double checked this, an old-format JSON spec that doesn't have the new fields deserializes fine (gets hasHeaderRow=false, skipHeaderRows=0).

Looks like this does need a legacy constructor though.

Thanks for the check! I raised #4388.

fjy and others added 10 commits April 20, 2017 16:25

initial commit

7bd6077

small fixes

c3965ed

fix bug

f68be6b

fix bug

e188eea

address code review

96374cb

more cr

8bbb5d3

more cr

aeb0fc6

more cr

7e7d6f0

fix

6ad82f5

Skip head rows for CSV and TSV

d0ca172

jihoonson closed this May 6, 2017

jihoonson reopened this May 6, 2017

jihoonson added 3 commits May 7, 2017 15:17

Move checking skipHeadRows to FileIteratingFirehose

1eaffbd

Remove checking null iterators

e723389

Remove unused imports

aadaa7a

gianm reviewed May 7, 2017

View reviewed changes

gianm added the Feature label May 7, 2017

Address comments

ff74a92

Fix compilation error

92c5b3e

gianm added the Design Review label May 9, 2017

jihoonson added 3 commits May 9, 2017 16:07

Merge branch 'supportheader' of https://github.com/fjy/druid into ski…

3fdc0d9

…p-head-rows

Address comments

7ffa967

Add more tests

d447f37

Add a comment to ReplayableFirehose

2b25c90

jihoonson mentioned this pull request May 10, 2017

Support headers for TSV/CSV files #4198

Closed

gianm reviewed May 11, 2017

View reviewed changes

Addressing comments

a9e7679

jihoonson closed this May 11, 2017

jihoonson reopened this May 11, 2017

gianm approved these changes May 12, 2017

View reviewed changes

Merge branch 'master' of https://github.com/druid-io/druid into skip-…

27d997f

…head-rows

gianm added this to the 0.10.1 milestone May 12, 2017

gianm changed the title ~~Add support for skipping head rows for CSV and TSV~~ Add support for headers and skipping thereof for CSV and TSV May 12, 2017

gianm requested review from fjy and jon-wei May 12, 2017 22:43

leventov approved these changes May 16, 2017

View reviewed changes

Add docs and fix typos

5290907

jon-wei approved these changes May 16, 2017

View reviewed changes

jon-wei merged commit 50a4ec2 into apache:master May 16, 2017

leventov reviewed Jun 8, 2017

View reviewed changes

jihoonson mentioned this pull request Jun 8, 2017

Add legacy constructor to CsvParseSpec and DelimitedParseSpec for backward compatibility #4388

Merged

gianm mentioned this pull request Jun 21, 2017

Add @ExtensionPoint and @PublicApi annotations. #4433

Merged

jihoonson mentioned this pull request May 16, 2017

Removing Optionals from method/constructor parameters #4275

Closed

Conversation

jihoonson commented May 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm May 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented May 9, 2017

Uh oh!

gianm commented May 9, 2017

Uh oh!

jihoonson commented May 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm commented May 11, 2017

Uh oh!

jihoonson commented May 11, 2017

Uh oh!

gianm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leventov left a comment

Choose a reason for hiding this comment

Uh oh!

jihoonson commented May 6, 2017 •

edited

Loading

gianm May 9, 2017 •

edited

Loading

gianm left a comment •

edited

Loading