Optimize unused segment query for segment allocation#16623
Optimize unused segment query for segment allocation#16623cryptoe merged 4 commits intoapache:masterfrom
Conversation
kfaraz
left a comment
There was a problem hiding this comment.
Changes look good, left a few non-blocking suggestions.
| log.debug("Found [%,d] unused segments for datasource[%s] for interval[%s] and version[%s].", | ||
| matchingSegments.size(), dataSource, interval, version); |
There was a problem hiding this comment.
Nit:
| log.debug("Found [%,d] unused segments for datasource[%s] for interval[%s] and version[%s].", | |
| matchingSegments.size(), dataSource, interval, version); | |
| log.debug( | |
| "Found [%,d] unused segments for datasource[%s], interval[%s] and version[%s].", | |
| matchingSegments.size(), dataSource, interval, version | |
| ); |
| new NumberedShardSpec(0, 0) | ||
| ); | ||
| DataSegment unusedSegmentForDifferentInterval = createSegment( | ||
| Intervals.of("2023/2024"), |
There was a problem hiding this comment.
Rather than a disjoint interval, a better test would be to verify that a segment in an overlapping (but not identical) interval is not returned.
| @Test | ||
| public void testRetrieveUnusedSegmentsForExactIntervalAndVersion() throws Exception | ||
| { | ||
| DataSegment unusedForDifferentVersion = createSegment( |
There was a problem hiding this comment.
| DataSegment unusedForDifferentVersion = createSegment( | |
| final DataSegment unusedSegmentMay2024V0 = createSegment( |
| public void testRetrieveUnusedSegmentsForExactIntervalAndVersion() throws Exception | ||
| { | ||
| DataSegment unusedForDifferentVersion = createSegment( | ||
| Intervals.of("2024/2025"), |
There was a problem hiding this comment.
Nit: use an interval which is easier to use in a name. You may even assign this interval value to a field named Interval may2024 so that you can reuse it in multiple places.
| Intervals.of("2024/2025"), | |
| Intervals.of("2024-05/P1M"), |
| "v0", | ||
| new NumberedShardSpec(0, 0) | ||
| ); | ||
| DataSegment unusedSegmentForExactIntervalAndVersion = createSegment( |
There was a problem hiding this comment.
| DataSegment unusedSegmentForExactIntervalAndVersion = createSegment( | |
| final DataSegment unusedSegmentMay2024V1 = createSegment( |
| new NumberedShardSpec(0, 0) | ||
| ); | ||
| DataSegment unusedSegmentForExactIntervalAndVersion = createSegment( | ||
| Intervals.of("2024/2025"), |
There was a problem hiding this comment.
| Intervals.of("2024/2025"), | |
| Intervals.of("2024-05/P1M"), |
| "v1", | ||
| new NumberedShardSpec(0, 0) | ||
| ); | ||
| DataSegment unusedSegmentForDifferentInterval = createSegment( |
There was a problem hiding this comment.
| DataSegment unusedSegmentForDifferentInterval = createSegment( | |
| final DataSegment unusedSegmentYear2024V1 = createSegment( |
| new NumberedShardSpec(0, 0) | ||
| ); | ||
| DataSegment unusedSegmentForDifferentInterval = createSegment( | ||
| Intervals.of("2023/2024"), |
There was a problem hiding this comment.
| Intervals.of("2023/2024"), | |
| Intervals.of("2024/P1Y"), |
| ); | ||
| coordinator.markSegmentsAsUnusedWithinInterval(DS.WIKI, Intervals.ETERNITY); | ||
|
|
||
| DataSegment usedSegmentForExactIntervalAndVersion = createSegment( |
There was a problem hiding this comment.
| DataSegment usedSegmentForExactIntervalAndVersion = createSegment( | |
| final DataSegment usedSegmentMay2024V1 = createSegment( |
| coordinator.markSegmentsAsUnusedWithinInterval(DS.WIKI, Intervals.ETERNITY); | ||
|
|
||
| DataSegment usedSegmentForExactIntervalAndVersion = createSegment( | ||
| Intervals.of("2024/2025"), |
There was a problem hiding this comment.
| Intervals.of("2024/2025"), | |
| Intervals.of("2024-05/P1M"), |
#16380 utilized an existing metadata query to fetch unused segments for a given datasource, interval and version but this appeared to take a long time despite the indexes, and could have potential overlord stability implications.
This PR optimizes the query by using an equality check on the interval start and end as it is a special case for segment allocation, instead of using the OVERLAPS or CONTAINS match modes.
On a cluster with 1.8M unused segments for a given datasource, the query which relied on the existing method took over 30s on average, while the new query takes less than a second.
This PR has: