Skip to content

add customize separator for TSV inputFormat#8993

Merged
jon-wei merged 8 commits intoapache:masterfrom
SEKIRO-J:TSVInputFormatUpdate
Dec 9, 2019
Merged

add customize separator for TSV inputFormat#8993
jon-wei merged 8 commits intoapache:masterfrom
SEKIRO-J:TSVInputFormatUpdate

Conversation

@SEKIRO-J
Copy link
Copy Markdown
Contributor

@SEKIRO-J SEKIRO-J commented Dec 5, 2019

Description

Fixed the bug ...

quick fix for #8915 to add support for delimiter field for TSV InputFormat
described here:
https://druid.apache.org/docs/latest/ingestion/data-formats.html#tsv-delimited-parsespec

for example,
user could set delimiter: "|" to use | as the delimiter for TSV input format.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.
  • added integration tests.
  • been tested in a test Druid cluster.

public TsvInputFormat(
@JsonProperty("columns") @Nullable List<String> columns,
@JsonProperty("listDelimiter") @Nullable String listDelimiter,
@JsonProperty("delimiter") @Nullable String delimiter,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying TsvInputFormat to support a custom delimiter is awkward, since "TSV" implies that the delimiter is a tab. How about using an approach similar to DelimitedParser and AbstractFlatTextFormatParser instead? For example, the current implementation here is introducing regressions not present in DelimitedParser (e.g., delimeter and listDelimeter should be checked to not have the same value).

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson Dec 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I would suggest to rename the type of this input format. There should be no compatibility issue since this input format hasn't been released yet. How about dsv or delimited?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://druid.apache.org/docs/latest/ingestion/data-formats.html#tsv-delimited-parsespec, However in this documentation, format: tsv implies user could use any delimiter, and default is tab.
How about this design: rename SeparateValueInputFormat to DelimitedInputFormat, when user say tsv, instantiate a DelimitedInputFormat object, when user say csv, instantiate CSVInputFormat which extends DelimitedInputFormat

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its original name was not good. We don't necessarily stick to the existing name, but can change it if there's a better one.

How about this design: rename SeparateValueInputFormat to DelimitedInputFormat, when user say tsv, instantiate a DelimitedInputFormat object, when user say csv, instantiate CSVInputFormat which extends DelimitedInputFormat

Do you mean you want to make DelimitedInputFormat as a concrete class? What is difference between DelimitedInputFormat and CSVInputFormat in that case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DelimitedInputFormat would be used to support any delimiter (default is tab), ie: when user say tsv as format.
CSVInputFormat would be used to support strictly comma as delimiter, ie: when user say csv as format

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it sounds good to me. But I think its type should be something else instead of tsv.

{
if (",".equals(delimiter)) {
return Format.CSV;
} else if (delimiter != null && delimiter.length() > 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since only single character delimiters are supported, suggest adding a check for this to the DelimitedInputFormat constructor that throws an exception if a multichar delimiter is provided, otherwise the behavior might be mysterious to users (specifying a delimiter but it's silently ignored)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved.


this.format = getFormat(delimiter);
Preconditions.checkArgument(
delimiter == null || delimiter.length() == 1,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is created with a "tsv" JSON type and delimiter is annotated with @Nullable, delimiter == null shouldn't be in this check.

If delimiter is null here, suggest defining a default delimiter (the tab) and setting delimiter to that here, and removing the delimiter == null clause in the else if in getFormat

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I misread the change there in the Preconditions check (it would actually allow null) but I would still suggest clearing the null early and setting it to a default value

Copy link
Copy Markdown
Contributor

@jon-wei jon-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 after CI

@jon-wei jon-wei merged commit ca77d57 into apache:master Dec 9, 2019
@jon-wei jon-wei added this to the 0.17.0 milestone Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants