add customize separator for TSV inputFormat#8993
Conversation
| public TsvInputFormat( | ||
| @JsonProperty("columns") @Nullable List<String> columns, | ||
| @JsonProperty("listDelimiter") @Nullable String listDelimiter, | ||
| @JsonProperty("delimiter") @Nullable String delimiter, |
There was a problem hiding this comment.
Modifying TsvInputFormat to support a custom delimiter is awkward, since "TSV" implies that the delimiter is a tab. How about using an approach similar to DelimitedParser and AbstractFlatTextFormatParser instead? For example, the current implementation here is introducing regressions not present in DelimitedParser (e.g., delimeter and listDelimeter should be checked to not have the same value).
There was a problem hiding this comment.
Good point. I would suggest to rename the type of this input format. There should be no compatibility issue since this input format hasn't been released yet. How about dsv or delimited?
There was a problem hiding this comment.
https://druid.apache.org/docs/latest/ingestion/data-formats.html#tsv-delimited-parsespec, However in this documentation, format: tsv implies user could use any delimiter, and default is tab.
How about this design: rename SeparateValueInputFormat to DelimitedInputFormat, when user say tsv, instantiate a DelimitedInputFormat object, when user say csv, instantiate CSVInputFormat which extends DelimitedInputFormat
There was a problem hiding this comment.
I think its original name was not good. We don't necessarily stick to the existing name, but can change it if there's a better one.
How about this design: rename
SeparateValueInputFormattoDelimitedInputFormat, when user saytsv, instantiate aDelimitedInputFormatobject, when user saycsv, instantiateCSVInputFormatwhich extendsDelimitedInputFormat
Do you mean you want to make DelimitedInputFormat as a concrete class? What is difference between DelimitedInputFormat and CSVInputFormat in that case?
There was a problem hiding this comment.
DelimitedInputFormat would be used to support any delimiter (default is tab), ie: when user say tsv as format.
CSVInputFormat would be used to support strictly comma as delimiter, ie: when user say csv as format
There was a problem hiding this comment.
Thanks, it sounds good to me. But I think its type should be something else instead of tsv.
| { | ||
| if (",".equals(delimiter)) { | ||
| return Format.CSV; | ||
| } else if (delimiter != null && delimiter.length() > 0) { |
There was a problem hiding this comment.
Since only single character delimiters are supported, suggest adding a check for this to the DelimitedInputFormat constructor that throws an exception if a multichar delimiter is provided, otherwise the behavior might be mysterious to users (specifying a delimiter but it's silently ignored)
|
|
||
| this.format = getFormat(delimiter); | ||
| Preconditions.checkArgument( | ||
| delimiter == null || delimiter.length() == 1, |
There was a problem hiding this comment.
Since this is created with a "tsv" JSON type and delimiter is annotated with @Nullable, delimiter == null shouldn't be in this check.
If delimiter is null here, suggest defining a default delimiter (the tab) and setting delimiter to that here, and removing the delimiter == null clause in the else if in getFormat
There was a problem hiding this comment.
Ah, sorry, I misread the change there in the Preconditions check (it would actually allow null) but I would still suggest clearing the null early and setting it to a default value
Description
Fixed the bug ...
quick fix for #8915 to add support for
delimiterfield for TSV InputFormatdescribed here:
https://druid.apache.org/docs/latest/ingestion/data-formats.html#tsv-delimited-parsespec
for example,
user could set
delimiter: "|"to use|as the delimiter for TSV input format.This PR has: