Skip to content

Fix task bootstrapping & simplify segment load/drop flows#16475

Merged
abhishekrb19 merged 48 commits intoapache:masterfrom
abhishekrb19:fixup_task_bootstrap_inject
Jun 4, 2024
Merged

Fix task bootstrapping & simplify segment load/drop flows#16475
abhishekrb19 merged 48 commits intoapache:masterfrom
abhishekrb19:fixup_task_bootstrap_inject

Conversation

@abhishekrb19
Copy link
Copy Markdown
Contributor

@abhishekrb19 abhishekrb19 commented May 20, 2024

Problems

  1. The segment bootstrapping logic relies on the presence of configured storage locations. That is, only servers that have storage location configured will act like a "segment data server". Tasks on the other hand don't have this static configuration because tasks can dynamically run on any slot. The CliPeon injects a list of dynamic StorageLocations. This List<StorageLocation> is not currently present in SegmentLoadDropHandler, which does the bootstrapping of segments as part of its start() lifecycle. Instead, this class directly uses config.getLocations(), which isn't set by the TaskToolboxFactory when it instantiates SegmentCacheManager.

  2. One of the overloaded constructors marked @VisibleForTesting in SegmentLocalCacheManager is also being used by tasks via SegmentCacheManagerFactory. This "test constructor" also does not initialize the thread pool executors, so segments won't be loaded into the page cache even if loader config has the appropriate settings.

Core changes

The core part of the fix in SegmentLoadDropHandler is to rely on the source of truth for the caching layer: SegmentCacheManager. Currently, the various interactions of the classes are complex and have abstraction leaks in its implementation:

  • SegmentManager -> SegmentCacheLoader -> SegmentCacheManager
  • SegmentLoadDropHandler -> SegmentManager -> SegmentCacheLoader -> SegmentCacheManager
  • SegmentLoadDropHandler -> SegmentCacheManager

The SegmentLoadDropHandler directly handles the segment info file creation and deletion. These low-level file operations should really be owned and handled by the SegmentCacheManager and the classes that interact with should just operate at the granularity of Segment objects.

With this patch, the updated interactions are as follows:

  • SegmentManager -> SegmentCacheManager
  • SegmentLoadDropHandler -> SegmentManager -> SegmentCacheManager

The addition and deletion of segment info files are handled by the SegmentLocalCacheManager class itself. The SegmentLoadDropHandler gets the set of cached segments from the SegmentManager. To that effect, the dependencies in SegmentLoadDropHandler have been cleaned up, and this class doesn't fiddle with the segment files directly.

In order to simplify this, we did the following:

  1. Removed the SegmentLoader interface and its concrete implementation SegmentLocalCacheLoader. The implementation was acting as a simple factory of Segment objects from the underlying segment files, which has now been moved to SegmentLocalCacheManager, still providing convenience for any test overrides for the factory.

  2. Added the following methods to SegmentCacheManager for callers to interact with:

    • boolean canHandleSegments(): A high-level method that informs callers whether the caching layer can handle segments or not.
    • List getCachedSegments()
    • void storeInfoFile(DataSegment segment)
    • void removeInfoFile(DataSegment segment)
    • ReferenceCountingSegment getSegment(DataSegment segment)
    • ReferenceCountingSegment getBootstrapSegment(DataSegment segment, SegmentLazyLoadFailCallback loadFailed)
    • void loadSegmentIntoPageCache(DataSegment segment)
    • void loadSegmentIntoPageCacheOnBootstrap(DataSegment segment)
  • SegmentManager is now responsible for calling storeInfoFile(segment) once the segment bookkeeping stuff is successful as part of the load segment flow.
  • The bootstrap executor was initialized by SegmentLoadDropHandler, which is now moved to the SegmentLocalCacheManager., along with the initialization of the segment download executor.
  • Methods in SegmentManager and SegmentLocalCacheManager have bene broken down into loadSegment() and loadBootstrapSegment() variants, because segment lazy loading and the fail callback are only applicable for the bootstrap segment flow. This cleanup ensures that callers don't pass unnecessary null arguments anymore and makes the intent clearer.
  • Remove some VisibleForTesting hooks, including the problematic "test constructor" in SegmentLocalCacheManager noted in the problems section. Where possible, the test code uses the same initialization logic as the core code.

Test changes

  • Added new tests in SegmentLocalCacheManagerTest and SegmentManagerTest.
  • Moved the existing tests from SegmentLoaderTest to SegmentLocalCacheManagerTest.
  • Added TestSegmentUtils for commonly used test utilities that has test implementation support for loading DataSegment, Segment factory, etc.

Future work

  1. We can further simplify and cleanup these interactions by moving the functionality of SegmentLoadDropHandler to SegmentManager such that the announcements moves to the manager layer itself. That'll entail a slightly larger refactor.
  2. Cleanup the test code in SegmentLocalCacheManagerTest

Both of these require moderate to large refactoring changes, which can be tackled separately.

Release note

Fixed an issue in task bootstrapping that prevented tasks from accepting any segment assignments, including broadcast segments.

This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@github-actions github-actions Bot added Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels May 20, 2024
- The load drop handler code talks to the local cache manager via
SegmentManager.
public void storeInfoFile(DataSegment segment) throws IOException
{
final File segmentInfoCacheFile = new File(getInfoDir(), segment.getId().toString());
if (!segmentInfoCacheFile.exists()) {

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression

This path depends on a [user-provided value](1).
final File factoryJson = new File(segmentFiles, "factory.json");
final SegmentizerFactory factory;

if (factoryJson.exists()) {

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression

This path depends on a [user-provided value](1).

if (factoryJson.exists()) {
try {
factory = jsonMapper.readValue(factoryJson, SegmentizerFactory.class);

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression

This path depends on a [user-provided value](1).
@abhishekrb19 abhishekrb19 force-pushed the fixup_task_bootstrap_inject branch from be28082 to aece4e6 Compare May 20, 2024 21:51
@abhishekrb19 abhishekrb19 force-pushed the fixup_task_bootstrap_inject branch from aece4e6 to 6e563aa Compare May 21, 2024 00:26
@abhishekrb19 abhishekrb19 requested review from cheddar and kfaraz May 21, 2024 01:31
Comment thread server/src/test/java/org/apache/druid/server/TestSegmentUtils.java Fixed
Comment thread server/src/test/java/org/apache/druid/server/TestSegmentUtils.java Fixed
Comment thread server/src/test/java/org/apache/druid/server/TestSegmentUtils.java Fixed
Comment thread server/src/test/java/org/apache/druid/server/TestSegmentUtils.java Fixed
try {
byte[] bytes = new byte[size];
ThreadLocalRandom.current().nextBytes(bytes);
Files.write(bytes, segmentFile);

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression

This path depends on a [user-provided value](1).
byte[] bytes = new byte[size];
ThreadLocalRandom.current().nextBytes(bytes);
Files.write(bytes, segmentFile);
Files.write("{\"type\":\"testSegmentFactory\"}".getBytes(StandardCharsets.UTF_8), factoryJson);

Check failure

Code scanning / CodeQL

Uncontrolled data used in path expression

This path depends on a [user-provided value](1).
@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented May 21, 2024

@abhishekrb19 , thanks for the detailed description.

IIUC, this PR is basically doing the following:

  • Inject the correct value of storage locations into SegmentLoadDropHandler
  • Refactor and cleanup classes

Is there any other bug fix or feature included in this PR too?
Just wanted to confirm before I start the actual review.

@abhishekrb19
Copy link
Copy Markdown
Contributor Author

abhishekrb19 commented May 21, 2024

IIUC, this PR is basically doing the following:

  • Inject the correct value of storage locations into SegmentLoadDropHandler
  • Refactor and cleanup classes

That's about right, except that we're not injecting storage locations into SegmentLoadDropHandler to fix the bug. Instead, SegmentLoadDropHandler relies on SegmentManager and SegmentCacheManager to provide the needed information. With this patch, SegmentCacheManager essentially serves as the source of truth for low-level segment info file operations, and this class already has the list of storage locations injected correctly.

Is there any other bug fix or feature included in this PR too? Just wanted to confirm before I start the actual review.

There's no other bug fix or feature in this PR. I do have another related change that touches some of this code, but given the size of this PR, I will add it separately to make reviewing easier. 🙂

@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented May 21, 2024

Makes sense. Thanks for the clarification, @abhishekrb19 !

@abhishekrb19 abhishekrb19 force-pushed the fixup_task_bootstrap_inject branch from 19c00e4 to 16658e1 Compare May 29, 2024 06:08
@abhishekrb19 abhishekrb19 force-pushed the fixup_task_bootstrap_inject branch from 1e028cb to e02f5cb Compare May 29, 2024 19:49
The intellij-inspect tool doesn't seem to correctly inspect
lambda usages. See ScheduledExecutors.
@abhishekrb19 abhishekrb19 requested a review from jon-wei May 30, 2024 21:37
@abhishekrb19 abhishekrb19 force-pushed the fixup_task_bootstrap_inject branch from d2d1c2d to f11da12 Compare June 3, 2024 16:37
@abhishekrb19 abhishekrb19 changed the title Fix task bootstrapping. Fix task bootstrapping & simplify segment load/drop flows Jun 3, 2024
@abhishekrb19 abhishekrb19 force-pushed the fixup_task_bootstrap_inject branch from c262ed4 to f97d10f Compare June 4, 2024 14:38
@abhishekrb19 abhishekrb19 merged commit b9ba286 into apache:master Jun 4, 2024
@abhishekrb19 abhishekrb19 mentioned this pull request Jun 21, 2024
10 tasks
@abhishekrb19 abhishekrb19 deleted the fixup_task_bootstrap_inject branch June 24, 2024 15:39
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying Release Notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants