Skip to content

Skip empty files for local, hdfs, and cloud input sources#9450

Merged
jihoonson merged 10 commits intoapache:masterfrom
jihoonson:index-non-empty-only
Mar 4, 2020
Merged

Skip empty files for local, hdfs, and cloud input sources#9450
jihoonson merged 10 commits intoapache:masterfrom
jihoonson:index-non-empty-only

Conversation

@jihoonson
Copy link
Copy Markdown
Contributor

Description

This PR modifies the input sources to skip empty files except for the HTTP input source. This PR additionally fixes the two bugs:


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.
  • added integration tests.
  • been tested in a test Druid cluster.

public class MaxSizeSplitHintSpec implements SplitHintSpec
{
public static final String TYPE = "maxSize";
private static final Logger LOG = new Logger(MaxSizeSplitHintSpec.class);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Removed.


import com.fasterxml.jackson.databind.ObjectMapper;
import nl.jqno.equalsverifier.EqualsVerifier;
import org.apache.commons.compress.utils.Lists;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be the one from guava instead? (same for MaxSizeSplitHintSpecTest)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Fixed.

Comment on lines +121 to +122
prepareNextRequest();
fetchNextBatch();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch is not covered by unit tests

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test.

for (int i = 0; i < 10; i++) {
filesInBaseDir.add(File.createTempFile("local-input-source", ".data", baseDir));
final File file = File.createTempFile("local-input-source", ".data", baseDir);
try (FileWriter writer = new FileWriter(file)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The forbidden apis checks is flagging this: java.io.FileWriter [Uses default charset]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


package org.apache.druid.storage.azure;

import com.google.api.client.util.Lists;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to update the pom to add this dependency:

<dependency>
  <groupId>com.google.http-client</groupId>
  <artifactId>google-http-client</artifactId>
  <scope>test</scope>
</dependency>

https://travis-ci.org/apache/druid/jobs/657595721#L2090

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this was a mistake. I'm not sure why the Intellij keeps adding a wrong one. Fixed it now.

@jihoonson jihoonson merged commit 9466ac7 into apache:master Mar 4, 2020
jihoonson added a commit to jihoonson/druid that referenced this pull request Mar 4, 2020
* Skip empty files for local, hdfs, and cloud input sources

* split hint spec doc

* doc for skipping empty files

* fix typo; adjust tests

* unnecessary fluent iterable

* address comments

* fix test

* use the right lists

* fix test

* fix test
@jihoonson jihoonson added this to the 0.18.0 milestone Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants