Faster batch segment allocation by reducing metadata IO by AmatyaAvadhanula · Pull Request #17420 · apache/druid

AmatyaAvadhanula · 2024-10-26T05:38:25Z

Problem

Metadata IO is the main bottleneck for segment allocation times, which can contribute to lag significantly.
Currently, the metadata call to fetch used segments limits the allocation rate, and contributes to allocation wait time which can grow quite quickly. The sum of allocation times is a direct contributor to lag.

Idea

Segment allocation with time chunk locking requires only the segment ids and the number of core partitions to determine the next id.
We can issue a much less expensive call to fetch segmentIds instead of segments if the segment payloads are significantly large due to the dimensions and metrics.
The core partitions do not change for a given (datasource, interval, version) and we can fetch exactly one segment by id for such combinations and reduce the load on the metadata store, and also reduce the time taken to build the DataSegment object on the Overlord.

Usage

Adds a new feature flag
druid.indexer.tasklock.segmentAllocationReduceMetadataIO with default value false

Setting this flag to true allows segment allocation to fetch only the required segmentIds and fewer segment payloads from the metadata store.
At present, this flag is only applicable for TimeChunk locking with batch segment allocation.

This PR has:

kfaraz

Minor suggestions.

kfaraz · 2024-10-29T05:40:48Z

  private long batchAllocationWaitTime = 0L;

+  @JsonProperty
+  private boolean segmentAllocationReduceMetadataIO = false;


This is used only for batch segment allocation, IIUC. Let's rename it to batchAllocationReduceMetadataIO.
Once this has been tested thoroughly, we can remove the flag altogether and use the new approach for both regular and batch allocation.

Are you against its usage in the normal flow with the same feature flag?
It's behind a flag anyway so I don't think there's a need to not make the changes there.

kfaraz · 2024-10-29T05:46:10Z

      boolean skipSegmentLineageCheck,
-      Collection<SegmentAllocationHolder> holders
+      Collection<SegmentAllocationHolder> holders,
+      boolean skipSegmentPayloadFetchForAllocation


Please rename this argument everywhere.

You can either call it reduceMetadataIO same as the config or fetchRequiredSegmentsOnly or something.

Suggested change

boolean skipSegmentPayloadFetchForAllocation

boolean reduceMetadataIO

kfaraz · 2024-10-29T05:50:03Z

  private final ConcurrentHashMap<AllocateRequestKey, AllocateRequestBatch> keyToBatch = new ConcurrentHashMap<>();
  private final BlockingDeque<AllocateRequestBatch> processingQueue = new LinkedBlockingDeque<>(MAX_QUEUE_SIZE);

+  private final boolean skipSegmentPayloadFetchForAllocation;


Please rename as suggested.

kfaraz · 2024-10-29T05:50:57Z

  }

+  @Test
+  @Ignore


Let's remove this test if we are not running it as a UT. If it is meant to be a benchmark, please put it in a Benchmark class and share the details of a some sample run in the PR description.

Understood, will remove it.

kfaraz · 2024-10-29T05:51:30Z

+      @Override
+      public boolean isSegmentAllocationReduceMetadataIO()
+      {
+        return true;


Would be nice to run the tests in this class for both true and false and not just true.

Will do this

kfaraz · 2024-10-29T05:53:09Z

+      );
+    }
+
+    // Populate the required segment info


Suggested change

// Populate the required segment info

// Create dummy segments for each segmentId with only the shard spec populated

maytasm · 2024-10-29T20:20:27Z

-            Segments.ONLY_VISIBLE
-        )
-    );
+    return metadataStorage.getSegmentTimelineForAllocation(


Is it inefficient here if skipSegmentPayloadFetchForAllocation is true? We are getting segments from retrieveUsedSegmentsForAllocation then creating a timeline via SegmentTimeline.forSegments and then getting segments back again via findNonOvershadowedObjectsInInterval. Why do we even need to create a timeline?

Do you mean when it is false?
If it is true, we are simply getting all the used ids in the retrieval call, so we'd have to create a SegmentTimeline as we're interested only in the visible segment set

maytasm · 2024-10-29T20:48:27Z

+
+    // Retrieve the segments for the ids stored in the map to get the numCorePartitions
+    final Set<String> segmentIdsToRetrieve = new HashSet<>();
+    for (Map<Interval, SegmentId> itvlMap : versionIntervalToSmallestSegmentId.values()) {


Why do we want/use the Smallest here?

The idea was so that we get a consistent result irrespective of the order in which the metadata store returns results

can you add as comment on the use of Smallest. Thanks

maytasm · 2024-10-29T21:02:38Z

+    return retrieveSegmentIds(dataSource, Collections.singletonList(interval));
+  }
+
+  private Set<SegmentId> retrieveSegmentIds(


should this also be called retrieveUsedSegmentIds?

Yes, it should. Thanks, I'll make the change.

AmatyaAvadhanula · 2024-10-29T23:47:16Z

I've added a TaskLockPosse level cache for segment allocations with the idea that the used segment set and pending segments are covered by an appending lock and cannot change unless altered by tasks holding these locks.

There are 24 intervals, each with 2000 datasegments, each with 1000 dimensions.
With with a batching of 5, and 128 allocations (2 replicas) per interval i.e 3072 in total

Here are the results using a simple unit test:

Batch segment allocation with the flag turned off

INFO [main] org.apache.druid.indexing.common.actions.SegmentAllocateActionTest - Total time taken for [3072] allocations is [653233] ms.

Batch segment allocation with the flag turned on : 3x Improvement

INFO [main] org.apache.druid.indexing.common.actions.SegmentAllocateActionTest - Total time taken for [3072] allocations is [236854] ms.

Normal segment allocation with the mentioned caching with the flag: 100x improvement

INFO [main] org.apache.druid.indexing.common.actions.SegmentAllocateActionTest - Total time taken for [3072] allocations is [6302] ms.

`TLDR` : Turning the flag on gives a 3x boost over batch segment allocation, while using the cache gives about 100x.

kfaraz · 2024-11-04T09:28:52Z

@AmatyaAvadhanula , caching the segment state seems like a good idea.

But it would be better to limit this PR to have only the reduceMetadataIO changes which fetch only the segment IDs and not all the payloads. That change has already been tested on some clusters and it is best to keep it as a separate patch from other experimental changes.

The caching changes can be done as a follow up to this PR.

…onReduceMetadataIO

maytasm · 2025-01-04T01:04:31Z

Do we have plan to implement TaskLockPosse level cache for segment allocations too? @AmatyaAvadhanula @kfaraz

kfaraz · 2025-01-06T05:24:42Z

@maytasm , I am currently working on caching segment metadata on the Overlord to speed up segment allocations.
I plan to publish that PR some time this week. Once that is in place, we shouldn't need any separate caching mechanism at the TaskLockPosse level.

maytasm · 2025-01-07T19:26:21Z

@maytasm , I am currently working on caching segment metadata on the Overlord to speed up segment allocations. I plan to publish that PR some time this week. Once that is in place, we shouldn't need any separate caching mechanism at the TaskLockPosse level.

@kfaraz Is it this PR, #17390 ? Thanks!

kfaraz · 2025-02-12T04:28:22Z

Closing this PR as #17653 and #17496 have already been merged.

Faster batch segment allocation by reducing metadata IO

2c3957e

AmatyaAvadhanula requested a review from kfaraz October 26, 2024 05:38

github-actions Bot added the Area - Ingestion label Oct 26, 2024

github-advanced-security AI found potential problems Oct 26, 2024

View reviewed changes

Comment thread ...rvice/src/test/java/org/apache/druid/indexing/common/actions/SegmentAllocationQueueTest.java Fixed

Comment thread server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java Fixed

kfaraz reviewed Oct 29, 2024

View reviewed changes

maytasm reviewed Oct 29, 2024

View reviewed changes

Rough code for TaskLockPosse level cache

1634801

AmatyaAvadhanula marked this pull request as draft October 29, 2024 23:37

github-advanced-security AI found potential problems Oct 30, 2024

View reviewed changes

Comment thread indexing-service/src/main/java/org/apache/druid/indexing/overlord/TaskLockbox.java Fixed

Turn flag on for tests

47199d6

kfaraz mentioned this pull request Nov 21, 2024

Reduce metadata IO during segment allocation #17496

Merged

10 tasks

Merge branch 'master' of github.com:apache/druid into segmentAllocati…

eedf543

…onReduceMetadataIO

kfaraz added a commit to kfaraz/druid that referenced this pull request Nov 21, 2024

Address comments from apache#17420

5ba2f24

kfaraz closed this Feb 12, 2025

kfaraz deleted the segmentAllocationReduceMetadataIO branch February 12, 2025 04:28

	boolean skipSegmentPayloadFetchForAllocation
	boolean reduceMetadataIO

                 }
+                @Test
+                @Ignore

	// Populate the required segment info
	// Create dummy segments for each segmentId with only the shard spec populated

Conversation

AmatyaAvadhanula commented Oct 26, 2024

Problem

Idea

Usage

Uh oh!

Uh oh!

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmatyaAvadhanula commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TLDR : Turning the flag on gives a 3x boost over batch segment allocation, while using the cache gives about 100x.

Uh oh!

Uh oh!

kfaraz commented Nov 4, 2024

Uh oh!

maytasm commented Jan 4, 2025

Uh oh!

kfaraz commented Jan 6, 2025

Uh oh!

maytasm commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfaraz commented Feb 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AmatyaAvadhanula commented Oct 29, 2024 •

edited

Loading

`TLDR` : Turning the flag on gives a 3x boost over batch segment allocation, while using the cache gives about 100x.

maytasm commented Jan 7, 2025 •

edited

Loading