Fix a bug for CSVParser/DelimitedParser when empty column exists in the header row#4504
Fix a bug for CSVParser/DelimitedParser when empty column exists in the header row#4504fjy merged 2 commits intoapache:masterfrom
Conversation
| { | ||
| return (input) -> { | ||
| if (input.contains(listDelimiter)) { | ||
| return Lists.newArrayList( |
| private int skippedHeaderRows; | ||
| private boolean supportSkipHeaderRows; | ||
|
|
||
| static Function<String, Object> getValueFunction( |
There was a problem hiding this comment.
Static should go above instance fields
| } | ||
|
|
||
| public AbstractFlatTextFormatParser( | ||
| final Optional<String> listDelimiter, |
There was a problem hiding this comment.
According to #4275, maybe make it to accept simple String? I know subclasses also have Optional, but in a PR you can keep the extend of this problem, not increasing it.
There was a problem hiding this comment.
I almost have forgotten that issue.. Yeah, this is ugly. I removed Optional from the constructor.
| final int maxSkipHeaderRows | ||
| ) | ||
| { | ||
| this.listDelimiter = listDelimiter.isPresent() ? listDelimiter.get() : Parsers.DEFAULT_LIST_DELIMITER; |
| * | ||
| * @return column name generating function | ||
| */ | ||
| public static IntFunction<String> getDefaultColumnNameGenerator() |
There was a problem hiding this comment.
IMO it's overengineering to to prepare for different "strategies" of column name generation, it could be just a static method accepting int. generateFieldNames() and abstract parser use this method.
There was a problem hiding this comment.
Changed. BTW, the name means "defaultColumnName" generator rather than default "columnNameGenerator".
|
@leventov thanks. I addressed your comments. |
| ParserUtils.nullEmptyStringFunction | ||
| ) | ||
| ); | ||
| final List retVal = StreamSupport.stream(listSplitter.split(input).spliterator(), false) |
There was a problem hiding this comment.
If you don't care about perf you could use splitAsList() for shorter code without StreamSupport
|
👍 |
…he header row (apache#4504) * Fix a bug when empty column exists in header row * Address comments
If some column names are empty in the header row, the field name becomes an empty string, thereby causing unexpected behaviors in index tasks. With this patch, empty columns have an auto-generated name based on their ordinal number in the header row. For example, given a header row of
timestamp,foo,,bar, the field names will be "timestamp", "foo", "column_3", and "bar".Also, I did some refactoring for CSVParser and DelimitedParser to reduce code duplication.
This change is