Fix ambiguity about IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval() returning only non-overshadowed or all used segments#8564
Conversation
…e() don't fetch abutting intervals; simplify getUsedSegmentsForIntervals()
| }) | ||
| .distinct() | ||
| .collect(Collectors.toList()); | ||
| return new ArrayList<>(timeline.iterateAllObjects()); |
There was a problem hiding this comment.
This can result in a different set of holders. lookup() can adjust the first and last holders so that they are aligned with the given interval.
There was a problem hiding this comment.
It doesn't matter because the result of this method is a collection of objects, not holders.
…method; Propagate the decision about whether only visible segmetns or visible and overshadowed segments should be returned from IndexerMetadataStorageCoordinator's methods to the user logic; Rename SegmentListUsedAction to RetrieveUsedSegmentsAction, SegmetnListUnusedAction to RetrieveUnusedSegmentsAction, and UsedSegmentLister to UsedSegmentsRetriever
|
The checkstyle issue looks legitimate; I'm not sure about the strict compilation one. |
…sonUtils.readValue() to reduce boilerplate code
…egmentsForInterval
…skTest and KafkaIndexTaskTest into SeekableStreamIndexTaskTestBase
…egmentsForInterval
|
A much larger issue (ambiguity of |
|
Could somebody please review this PR? |
|
|
||
| /** | ||
| * This enum is used a parameter for several methods in {@link VersionedIntervalTimeline}, specifying whether only | ||
| * complete partitions should be considered, or incomplete partitions as well. |
There was a problem hiding this comment.
Should this explain what "complete partition" means? Also it can link PartitionHolder.isComplete().
There was a problem hiding this comment.
Since VersionedIntervalTimeline treats completion status mechanically rather than semantically (that is, the implementation of VersionedIntervalTimeline would be the same regardless of what isComplete() semantically means), I just added reference to the method. I think Javadoc should be added to isComplete() method to explain illustratively what does it mean, but I don't feel confident about it. I've also opened issue #8788 which is related.
| .list(); | ||
| } | ||
|
|
||
| private Query<Map<String, Object>> createUsedSegmentsSqlQueryForIntervals( |
There was a problem hiding this comment.
Would you please add a javadoc for this method?
|
|
||
| /** | ||
| * This enum is used as a parameter for several methods in {@link IndexerMetadataStorageCoordinator}, specifying whether | ||
| * only visible segments, or visible as well as overshadowed segments should be included in results. |
There was a problem hiding this comment.
Since this enum is used only in IndexerMetadataStorageCoordinator, it will be more accurate if it says all segments in the result are published segments.
There was a problem hiding this comment.
I think this would be overspecification for the enum. Instead, I've added "published" adjective to descriptions of all related methods in IndexerMetadataStorageCoordinator.
|
|
||
| @Test | ||
| public void testSegmentListUsedAction() | ||
| public void testRetrieveUsedSegmentsAction() |
There was a problem hiding this comment.
Please change the name of this class as well.
| import java.util.Set; | ||
| import java.util.stream.Collectors; | ||
|
|
||
| public class SeekableStreamIndexTaskTestBase extends EasyMockSupport |
There was a problem hiding this comment.
I didn't check all tests in Kafka/KinesisIndexTaskTest, but just assume they are basically same with before except that common parts are extracted as this base class. Is this correct?
| * @param interval The interval for which all applicable and used datasources are requested. Start is inclusive, | ||
| * end is exclusive | ||
| * @param visibility Whether only visible or visible as well as overshadowed segments should be returned. The | ||
| * visibility is considered within the specified interval: that is, a segment which is globally |
There was a problem hiding this comment.
What does it mean by "globally visible"?
There was a problem hiding this comment.
Reworded to "visible outside of the specified interval(s)"
There was a problem hiding this comment.
It's still unclear to me what "visible outside of an interval" means. Does this imply that a segment can be overshadowed only when we want to find the most recent one among segments in the same interval?
There was a problem hiding this comment.
I didn't understand your question but added a more detailed description of visibility to the Javadoc for Segments and linked from the methods. See this commmit.
| for (int i = 0; i < intervals.size(); i++) { | ||
| sb.append( | ||
| StringUtils.format("(start <= ? AND %1$send%1$s >= ?)", connector.getQuoteString()) | ||
| StringUtils.format("(start < ? AND %1$send%1$s > ?)", connector.getQuoteString()) |
| * the intervals in the series. | ||
| * | ||
| * If not specified otherwise, visibility (or overshadowness) should be assumed on the interval (-inf, +inf). This | ||
| * visibility may also be called "global" or "general" visibility. |
There was a problem hiding this comment.
This new doc looks nice. Just one question is, do we need the concept of "global visibility"? Seems like it's enough to say If not specified otherwise, visibility (or overshadowness) should be assumed on the interval (-inf, +inf)..
There was a problem hiding this comment.
Ok, removed the last sentence.
jihoonson
left a comment
There was a problem hiding this comment.
LGTM. Thanks for quick update.
A PR that helps with failing tests in #7306.
Description
Semantic changes, API changes, corrections
IndexerMetadataStorageCoordinator.getUsedSegmentsForInterval()andgetUsedSegmentsForIntervals()methods used to say nothing about whether it returns only visible used segments in the specified interval(s) or all used segments. The implementation used to return only visible segments. I addedSegments visibilityparameter to these methods, whereSegmentsis a enum with valuesONLY_VISIBLEandINCLUDING_OVERSHADOWED, to force users of these methods to always make an explicit choice. I'm not sure all existing users of this API actually want only visible segments. These include:MaterializedViewSupervisor. FYI @sekingme, @zhangxinyu1HadoopIngestionSpec. FYI @nishantmonu51ActionBasedUsedSegmentChecker. FYI @jihoonson, @gianmSegmentAllocateAction. FYI @jihoonson, @gianmCompactionTask. FYI @jihoonsonMetadataResource: this is the place where I've changed the former behavior:/datasources/{dataSourceName}/segmentsendpoint now returns all used segments (including overshadowed) on the specified intervals, rather than only visible ones. This is a potentiallyIncompatiblechange, so it warransDesign Reviewand a mention inRelease Notes.getUsedSegmentsForInterval()andgetUsedSegmentsForIntervals()now also return aCollectioninstead of aListto emphasize that the order of the returned segments is unspecified (and also to avoid some unnecesssary copying between collections in the implementation).UsedSegmentsListertoUsedSegmentsRetriever. The immediate reason for this rename is that this command doesn't return aListanymore (now it returns aCollection). The new name also makes more apparent the cost of this operation (an RPC call). Also, addedSegments visibilityparameter to this command to reflect the changes inIndexerMetadataStorageCoordinator's API. Note:UsedSegmentsListeris not Druid's public or extension API.SegmentListUsedActiontoRetrieveUsedSegmentsActionandSegmentListUnusedActiontoRetrieveUnusedSegmentsAction. Also, addedSegments visibilityparameter to the first action. The JSON serialization names of the commands are not changed, for backward compatibility: it's still "segmentListUsed" and "segmentListUnused", respectively. Even after updating Druid to Jackson 2.9, these names may not be changed as with property names (see Rename poorly named properties and config options when updating to Jackson 2.9 #7152), because these names are specified via a different beast:JsonSubTypes.Type. The addition ofSegments visibilityparameter (including to the JSON serialization form) should be backward- and forward-compatible in Jackson. When absent, the parameter defaults toONLY_VISIBLE(the former behavior).MetadataSegmentManager.retrieveAllDataSourceNames()now returns aSetinstead of aCollection.Fixes in tests
A lot of changes primarily in
KafkaIndexTaskTestandKinesisIndexTaskTestwhich relied on a specific order of results returned fromIndexerMetadataStorageCoordinator.getUsedSegmentsForInterval()which was unspecified and changed in this PR.Refactorings and new utility methods
JacksonUtils.readValue()to reduce boilerplate with catchingIOExceptionand wrapping it into aRuntimeException.VersionedIntervalTimeline.findNonOvershadowedObjectsInInterval()which before appeared in multiple places in the codebase ad-hoc. This change also fixes locking duplicating segments inAbstractBatchIndexTask, which could have been a bug, or benign unnecessary behavior. FYI @jihoonson. (This is the reason why this PR is labeledBug.)Refactoring of tests
SeekableStreamIndexTaskTestBaseto reduce duplication betweenKafkaIndexTaskTestandKinesisIndexTaskTest. Note that this contradicts the advice to favor composition and "Fixture" classes over inheritance in tests which I brought up a few days ago in the mailing list. This is because with the current Druid's policy to not use static imports, usage of static methods would be way too mouthful. There may be a way to refactor these tests in object terms to create composable object fixtures. That would be better than inheritance, but I didn't explore this because this work would be too far outside of the scope of the PR. Anyway, I think at least reducing duplication via a common base class is a step in the right direction, both compared to not doing anything and leaving the duplicated code as is, and as an intermediate step towards object composition refactoring, if somebody endeavors to do it in the future.VersionedIntervalTimelineSpecificDataTestfromVersionedIntervalTimelineTest. Half of the methods in formerVersionedIntervalTimelineTestrequired a differentsetUp()than the other half. I've split these halves between separate classes to avoid confusion. However, I've also added a common base classVersionedIntervalTimelineTestBasefor them to avoid duplication. In this case, just allowing static imports from one class to another would make this unnecessary. I'll write about this in the mailing list.Optimization
ListinPartitionHolder.iterator(),spliterator(), andstream()methods. See changes inPartitionHolderandOvershadowableManagerclasses.Listin the implementation ofNewestSegmentFirstIteratorand unnecessary sort infindSegmentsToCompact().This PR has: