Kill segments by versions#15994
Conversation
Kill tasks by default kill all versions of unused segments in the specified interval. Users wanting to delete specific versions (for example, data compliance reasons) and keep rest of the versions can specify the optional version in the kill task payload.
5c22821 to
b49d7f9
Compare
Sort of like method-level parameterized tests.
kfaraz
left a comment
There was a problem hiding this comment.
Left some comments. On the whole, I would prefer that we override the methods where possible to add a new flavour that accepts versions. Otherwise, we have to pass a lot of nulls around which is a little confusing.
Retain the old interface method and make it default and route it to the method with nullable versions variant. Update usages to use the default method where versions doesn't matter.
d364931 to
4341955
Compare
kfaraz
left a comment
There was a problem hiding this comment.
Left a couple of final comments, otherwise all looks good.
Haven't gone through the tests in detail. But should we also add a test to verify that we don't kill a segment whose load spec is currently being used by some other version segment?
Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>
…sed load spec. Checkpoint.
| final String versionsStr = versions.stream() | ||
| .map(version -> "'" + version + "'") | ||
| .collect(Collectors.joining(",")); | ||
| sb.append(StringUtils.format(" AND version IN (%s)", versionsStr)); |
There was a problem hiding this comment.
should we enforce some limit on the number of versions here? I think in practice most users wouldnt be specifying too many versions at once.
There was a problem hiding this comment.
Good point. However, we don't enforce an upper bound on some related parameters like limit and batchSize, so I'm unsure if we want to do it for versions size. Also, the maximum size for IN would also depend on the underlying metadata store itself, so my suggestion would be to roll this out without any size restriction right out of the bat, and revisit this later if needed
| final DateTime maxUsedStatusLastUpdatedTime1 = DateTimes.nowUtc(); | ||
|
|
||
| // Delay for 1s, mark the segments as unused and then capture the last updated time cutoff again | ||
| Thread.sleep(1000); |
There was a problem hiding this comment.
I think we can actually remove the sleep here and in the other existing tests by directly marking the segments as unused using the test connector. That way, the test would have control over the last updated time and we can set it to whatever time. I can clean up this pattern along with other miscellaneous testing stuff in a follow-up patch
kfaraz
left a comment
There was a problem hiding this comment.
Thanks for the changes, @abhishekrb19 !
The test improvements can be done in a follow up.
Out of curiosity, do you plan to make similar changes for markUsed / markUnused APIs too?
| null, | ||
| null, | ||
| null | ||
| ); |
There was a problem hiding this comment.
Indentation seems off when looking at the args in the previous lines.
| final DateTime now = DateTimes.nowUtc(); | ||
| final String v1 = now.toString(); | ||
| final String v2 = now.minusHours(2).toString(); | ||
| final String v3 = now.minusHours(3).toString(); | ||
|
|
||
| final DataSegment segment1 = newSegment(Intervals.of("2019-01-01/2019-02-01"), v1, ImmutableMap.of("foo", "1")); | ||
| final DataSegment segment2 = newSegment(Intervals.of("2019-02-01/2019-03-01"), v2, ImmutableMap.of("foo", "1")); | ||
| final DataSegment segment3 = newSegment(Intervals.of("2019-03-01/2019-04-01"), v3, ImmutableMap.of("foo", "1")); |
There was a problem hiding this comment.
Many of the tests seem to be using similar segments, versions, etc. Do you think some of this can go into the setup method?
| umbrellaInterval, | ||
| null, | ||
| null | ||
| ); |
There was a problem hiding this comment.
Indentation seems off. Closing brace should have smaller indentation than the preceding args.
|
Thanks for the thorough reviews, @kfaraz!
Yes, I'm going to do similar changes to the markUsed / markUnused APIs next. I will also follow up on the test improvements in a separate patch. |
Summary:
Prior to this patch, kill tasks will delete all versions of unused segments within the specified interval. With this patch, users can now delete specific versions of unused data, while retaining the rest by specifying an optional list of
versionsin the kill task payload. If left unspecified, the default behavior remains unchanged, i.e., delete all versions of unused segments in the interval.Motivation:
Note that adding an optional list of
versionssupport to/markAsUsedand/markAsUnusedsegment management APIs would be a complementary addition. I didn't make those changes in this PR to keep it simple to review - I will follow up on that later.Release note
Kill task accepts an optional list of unused segment
versionsto delete.Key changed/added classes in this PR
KillUnusedSegmentsTask.javaIndexerSQLMetadataStorageCoordinator.javaRetrieveUnusedSegmentsAction.javaKillUnusedSegmentsTaskTest.javaIndexerSQLMetadataStorageCoordinatorTest.javaRetrieveSegmentsActionsTest.javaThis PR has: