Skip to content

Make intermediate store for shuffle tasks an extension point#11492

Merged
maytasm merged 6 commits intoapache:masterfrom
maytasm:IMPLY-8622
Jul 27, 2021
Merged

Make intermediate store for shuffle tasks an extension point#11492
maytasm merged 6 commits intoapache:masterfrom
maytasm:IMPLY-8622

Conversation

@maytasm
Copy link
Copy Markdown
Contributor

@maytasm maytasm commented Jul 23, 2021

Make intermediate store for shuffle tasks an extension point

Description

This PR makes IntermediaryDataManager and ShuffleClient an interface with method that can be implements by extension to customize intermediate storage location for shuffle tasks. For example, implementation can be added via extensions to support different cloud storages to store intermediate data for shuffle tasks.

More details: #11297

Note that IntermediaryDataManager has been renamed to LocalIntermediaryDataManager with a few additional changes:

  • Update Javadoc
  • Add @Override for methods that are now interfaced by (new) IntermediaryDataManager interface class.
  • findPartitionFile changed to return Optional instead of File

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Jul 23, 2021

This pull request introduces 1 alert when merging caabe48 into c98e7c3 - view on LGTM.com

new alerts:

  • 1 for Cross-site scripting

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Jul 23, 2021

This pull request introduces 1 alert when merging 3a7a2d6 into c98e7c3 - view on LGTM.com

new alerts:

  • 1 for Cross-site scripting

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Jul 26, 2021

This pull request introduces 1 alert when merging 6086f65 into fcb908d - view on LGTM.com

new alerts:

  • 1 for Cross-site scripting

}
).build();
try {
long size = partitionFile.get().size();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will potentially read the entire file to get the size. I believe partitionFile.length() is just a metadata lookup on the file to get the length.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partitionFile is of type Optional<ByteSource>. The get is to get the ByteSource from the wrapping Optional class.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at the default implementation of ByteSource#size() which opens the stream, reads it and counts the bytes. It looks like FileByteSource has overridden that implementation to return the length of the File without reading the bytes.

At this point, I'm mostly concerned about whether someone can trip over this by changing the implementation and not realizing that this operation should be fast. Anyways, that's a problem for another time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LocalIntermediaryDataManager uses Files.asByteSource() which returns FileByteSource which do file.length() which is the same as current behavior. The extension (that implements IntermediaryDataManager) can extends ByteSource and provide a method of returning size efficiently. I added javadoc to indicate this

JsonConfigProvider.bind(binder, "druid", DruidNode.class, Parent.class);
JsonConfigProvider.bind(binder, "druid.worker", WorkerConfig.class);

CliPeon.configureIntermediaryData(binder);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any integration tests that verify shuffle with indexers continue to work after this change? I haven't looked at the existing integration tests closely

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many ITs that run indexers with the new config unset (which basically fallback to using "local" storage for storing intermediary segments via LocalIntermediaryDataManager). Specifically, input source integration test with Indexer runs ingestion with Hashed partitioning and maxNumConcurrentSubTasks=10, which would run the ingestion in two phases (first phase which persist to local using LocalIntermediaryDataManager and second phase which reads segments from first phase). Similarly, there are also some other ITs in compaction/auto compaction that uses Hashed partitioning.

Copy link
Copy Markdown
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@maytasm
Copy link
Copy Markdown
Contributor Author

maytasm commented Jul 26, 2021

The LGTM is a false positive. We would only return the supervisorTaskId (from the request) in the response if the request succeeded. For the request to succeeded, we would have verified that the supervisorTaskId is valid via IdUtils.validateId("supervisorTaskId", supervisorTaskId);

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Jul 26, 2021

This pull request introduces 1 alert when merging aa2a2c2 into fcb908d - view on LGTM.com

new alerts:

  • 1 for Cross-site scripting

@maytasm
Copy link
Copy Markdown
Contributor Author

maytasm commented Jul 27, 2021

Merging this in as the design proposal (#11297), proposed by @pjain1, also have two +1s by myself and @suneet-s

@maytasm maytasm merged commit c068906 into apache:master Jul 27, 2021
@maytasm maytasm deleted the IMPLY-8622 branch July 27, 2021 04:29
@pjain1
Copy link
Copy Markdown
Member

pjain1 commented Jul 27, 2021

missed this, @maytasm are you already working on support for deep storage as intermediate data store as I am also in middle of implementing it.

@pjain1
Copy link
Copy Markdown
Member

pjain1 commented Jul 28, 2021

Deep store support feature - #11507

@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants