Skip to content

add TsvInputFormat#8915

Merged
jon-wei merged 8 commits intoapache:masterfrom
SEKIRO-J:tsvInputFormat
Nov 23, 2019
Merged

add TsvInputFormat#8915
jon-wei merged 8 commits intoapache:masterfrom
SEKIRO-J:tsvInputFormat

Conversation

@SEKIRO-J
Copy link
Copy Markdown
Contributor

@SEKIRO-J SEKIRO-J commented Nov 21, 2019

Description

add support for TSV format file.
add tsvInputFormat, tsvReader, and unit tests for them.

tsvInputFormat and csvInputFormat are pretty similar, I'm thinking have a parent class called seperateValueInputFormat implements InputFormat
and
tsvInputFormat extends seperateValueInputFormat, csvInputFormat extends seperateValueInputFormat

Configuration Interface Design

 "inputFormat": {
  "type": "tsv",
  "columns": [ "col1", "col2", "col3" ],
  "listDelimiter": "|",
  "findColumnsFromHeader" : true,
  "skipHeaderRows" : 3
}

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.
  • added integration tests.
  • been tested in a test Druid cluster.

Key changed/added classes in this PR
  • tsvInputFormat
  • tsvReader

Copy link
Copy Markdown
Contributor

@a2l007 a2l007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being the grammar police here, but it would be better if we correct Seperate to Separate

@SEKIRO-J
Copy link
Copy Markdown
Contributor Author

Sorry for being the grammar police here, but it would be better if we correct Seperate to Separate

thank you for pointing out

if (!this.columns.isEmpty()) {
for (String column : this.columns) {
Preconditions.checkArgument(
!column.contains("tab".equals(seperator) ? "\t" : ","),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of using something like the AbstractFlatTextFormatParser.FlatTextFormat enum for this instead of hard-coding strings?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a good idea, working on that.

public int hashCode()
{
return Objects.hash(listDelimiter, columns, findColumnsFromHeader, skipHeaderRows);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like format is missing in equalsAndHashCode

this.multiValueFunction = ParserUtils.getMultiValueFunction(finalListDelimeter, Splitter.on(finalListDelimeter));
this.columns = findColumnsFromHeader ? null : columns; // columns will be overriden by header row
this.format = format;
this.parser = createOpenCsvParser(format.getDefaultDelimiter().charAt(0));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be cleaner for the format to have a method that returns a char for this

Copy link
Copy Markdown
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! I read through it and have some overall thoughts. I'm just asking a bunch of questions to get a better understanding.

Overall comments:

  • Please add javadocs to the new files you've created. I think that will help clear up a lot of my confusion.
  • Could you explain why you chose to have the CSV* and TSV* classes extend the SeparateValue* classes. I like your idea of sharing the logic between them in a common class, but I think composition might be an easier approach to follow (ie CSV* and TSV* classes have a delegate SeparateValue* object that is instantiated differently based on which class is calling it)

CSV(","),
TSV("\t");

private final String defaultDelimiter;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why defaultDelimiter Should this just be delimiter?


public String getLiteral()
{
return ",".equals(defaultDelimiter) ? "comma" : "tab";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: IMO it'd be easier to read if the enum had to define both the delimiter and the literal

CSV(",", "comma")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

private final FlatTextFormat format;

@JsonCreator
public SeparateValueInputFormat(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this class ever instantiated via json? I think it's just a base class if I'm reading the PR correctly - so I think we can remove all the Json annotations. And I think the constructor should only be package private

);
}

public InputEntityReader createReader(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I find this a little confusing. The function to create a reader above creates a SeparateValueReader, but if you pass in a format it constructs either a CSVReader or a TSVReader. I think I'm finding it hard to wrap my head around the differences

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I found that confusing too, will refactor that code

}

@VisibleForTesting
public TsvInputFormat(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this public constructor to be visible for testing? If it was package private, that might have been ok, but it's hard to enforce public constructors are not used elsewhere in the main source code.

Maybe it's better if the tests are explicit about passing hasHeaderRow as null when instantiating this object

* It implements the common logic between {@link CsvInputFormat} and {@link TsvInputFormat}
* Should never be instantiated
*/
public class SeparateValueInputFormat implements InputFormat
Copy link
Copy Markdown
Contributor

@jon-wei jon-wei Nov 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be an abstract class and you can leave the "Should never be instantiated" comment out

Copy link
Copy Markdown
Contributor

@jon-wei jon-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after CI

@jon-wei jon-wei merged commit 0514e56 into apache:master Nov 23, 2019
@jon-wei jon-wei added this to the 0.17.0 milestone Dec 17, 2019
@jihoonson jihoonson mentioned this pull request Jan 17, 2020
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants