Skip to content

Fix a bug for CSVParser/DelimitedParser when empty column exists in the header row#4504

Merged
fjy merged 2 commits intoapache:masterfrom
jihoonson:fix-empty-column
Jul 7, 2017
Merged

Fix a bug for CSVParser/DelimitedParser when empty column exists in the header row#4504
fjy merged 2 commits intoapache:masterfrom
jihoonson:fix-empty-column

Conversation

@jihoonson
Copy link
Copy Markdown
Contributor

@jihoonson jihoonson commented Jul 5, 2017

If some column names are empty in the header row, the field name becomes an empty string, thereby causing unexpected behaviors in index tasks. With this patch, empty columns have an auto-generated name based on their ordinal number in the header row. For example, given a header row of timestamp,foo,,bar, the field names will be "timestamp", "foo", "column_3", and "bar".
Also, I did some refactoring for CSVParser and DelimitedParser to reduce code duplication.


This change is Reviewable

{
return (input) -> {
if (input.contains(listDelimiter)) {
return Lists.newArrayList(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use Streams

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

private int skippedHeaderRows;
private boolean supportSkipHeaderRows;

static Function<String, Object> getValueFunction(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static should go above instance fields

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

}

public AbstractFlatTextFormatParser(
final Optional<String> listDelimiter,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to #4275, maybe make it to accept simple String? I know subclasses also have Optional, but in a PR you can keep the extend of this problem, not increasing it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost have forgotten that issue.. Yeah, this is ugly. I removed Optional from the constructor.

final int maxSkipHeaderRows
)
{
this.listDelimiter = listDelimiter.isPresent() ? listDelimiter.get() : Parsers.DEFAULT_LIST_DELIMITER;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's orElse()

*
* @return column name generating function
*/
public static IntFunction<String> getDefaultColumnNameGenerator()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's overengineering to to prepare for different "strategies" of column name generation, it could be just a static method accepting int. generateFieldNames() and abstract parser use this method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. BTW, the name means "defaultColumnName" generator rather than default "columnNameGenerator".

@jihoonson
Copy link
Copy Markdown
Contributor Author

@leventov thanks. I addressed your comments.

ParserUtils.nullEmptyStringFunction
)
);
final List retVal = StreamSupport.stream(listSplitter.split(input).spliterator(), false)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't care about perf you could use splitAsList() for shorter code without StreamSupport

@fjy fjy added the Bug label Jul 7, 2017
@fjy
Copy link
Copy Markdown
Contributor

fjy commented Jul 7, 2017

👍

@fjy fjy merged commit 8ed25ac into apache:master Jul 7, 2017
@fjy fjy added this to the 0.10.1 milestone Jul 7, 2017
jihoonson added a commit to jihoonson/druid that referenced this pull request Jul 7, 2017
…he header row (apache#4504)

* Fix a bug when empty column exists in header row

* Address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants