Skip to content

Fix ambiguity about IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval() returning only non-overshadowed or all used segments#8564

Merged
jihoonson merged 26 commits intoapache:masterfrom
metamx:refactor-getUsedSegmentsForInterval
Nov 6, 2019
Merged

Conversation

@leventov
Copy link
Copy Markdown
Member

@leventov leventov commented Sep 20, 2019

A PR that helps with failing tests in #7306.

Description

Semantic changes, API changes, corrections

  • The specification (and the names) of IndexerMetadataStorageCoordinator.getUsedSegmentsForInterval() and getUsedSegmentsForIntervals() methods used to say nothing about whether it returns only visible used segments in the specified interval(s) or all used segments. The implementation used to return only visible segments. I added Segments visibility parameter to these methods, where Segments is a enum with values ONLY_VISIBLE and INCLUDING_OVERSHADOWED, to force users of these methods to always make an explicit choice. I'm not sure all existing users of this API actually want only visible segments. These include:
    • MaterializedViewSupervisor. FYI @sekingme, @zhangxinyu1
    • HadoopIngestionSpec. FYI @nishantmonu51
    • ActionBasedUsedSegmentChecker. FYI @jihoonson, @gianm
    • SegmentAllocateAction. FYI @jihoonson, @gianm
    • CompactionTask. FYI @jihoonson
    • MetadataResource: this is the place where I've changed the former behavior: /datasources/{dataSourceName}/segments endpoint now returns all used segments (including overshadowed) on the specified intervals, rather than only visible ones. This is a potentially Incompatible change, so it warrans Design Review and a mention in Release Notes.
  • getUsedSegmentsForInterval() and getUsedSegmentsForIntervals() now also return a Collection instead of a List to emphasize that the order of the returned segments is unspecified (and also to avoid some unnecesssary copying between collections in the implementation).
  • Renamed UsedSegmentsLister to UsedSegmentsRetriever. The immediate reason for this rename is that this command doesn't return a List anymore (now it returns a Collection). The new name also makes more apparent the cost of this operation (an RPC call). Also, added Segments visibility parameter to this command to reflect the changes in IndexerMetadataStorageCoordinator's API. Note: UsedSegmentsLister is not Druid's public or extension API.
  • Renamed SegmentListUsedAction to RetrieveUsedSegmentsAction and SegmentListUnusedAction to RetrieveUnusedSegmentsAction. Also, added Segments visibility parameter to the first action. The JSON serialization names of the commands are not changed, for backward compatibility: it's still "segmentListUsed" and "segmentListUnused", respectively. Even after updating Druid to Jackson 2.9, these names may not be changed as with property names (see Rename poorly named properties and config options when updating to Jackson 2.9 #7152), because these names are specified via a different beast: JsonSubTypes.Type. The addition of Segments visibility parameter (including to the JSON serialization form) should be backward- and forward-compatible in Jackson. When absent, the parameter defaults to ONLY_VISIBLE (the former behavior).
  • MetadataSegmentManager.retrieveAllDataSourceNames() now returns a Set instead of a Collection.

Fixes in tests

A lot of changes primarily in KafkaIndexTaskTest and KinesisIndexTaskTest which relied on a specific order of results returned from IndexerMetadataStorageCoordinator.getUsedSegmentsForInterval() which was unspecified and changed in this PR.

Refactorings and new utility methods

  • Added JacksonUtils.readValue() to reduce boilerplate with catching IOException and wrapping it into a RuntimeException.
  • Added VersionedIntervalTimeline.findNonOvershadowedObjectsInInterval() which before appeared in multiple places in the codebase ad-hoc. This change also fixes locking duplicating segments in AbstractBatchIndexTask, which could have been a bug, or benign unnecessary behavior. FYI @jihoonson. (This is the reason why this PR is labeled Bug.)

Refactoring of tests

  • Extracted class abstract class SeekableStreamIndexTaskTestBase to reduce duplication between KafkaIndexTaskTest and KinesisIndexTaskTest. Note that this contradicts the advice to favor composition and "Fixture" classes over inheritance in tests which I brought up a few days ago in the mailing list. This is because with the current Druid's policy to not use static imports, usage of static methods would be way too mouthful. There may be a way to refactor these tests in object terms to create composable object fixtures. That would be better than inheritance, but I didn't explore this because this work would be too far outside of the scope of the PR. Anyway, I think at least reducing duplication via a common base class is a step in the right direction, both compared to not doing anything and leaving the duplicated code as is, and as an intermediate step towards object composition refactoring, if somebody endeavors to do it in the future.
  • Extracted VersionedIntervalTimelineSpecificDataTest from VersionedIntervalTimelineTest. Half of the methods in former VersionedIntervalTimelineTest required a different setUp() than the other half. I've split these halves between separate classes to avoid confusion. However, I've also added a common base class VersionedIntervalTimelineTestBase for them to avoid duplication. In this case, just allowing static imports from one class to another would make this unnecessary. I'll write about this in the mailing list.

Optimization

  • Don't create an intermediate List in PartitionHolder.iterator(), spliterator(), and stream() methods. See changes in PartitionHolder and OvershadowableManager classes.
  • Don't copy data into an intermediate List in the implementation of NewestSegmentFirstIterator and unnecessary sort in findSegmentsToCompact().

This PR has:

  • been self-reviewed.

…e() don't fetch abutting intervals; simplify getUsedSegmentsForIntervals()
Copy link
Copy Markdown
Member

@nishantmonu51 nishantmonu51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@leventov leventov added the WIP label Sep 20, 2019
Copy link
Copy Markdown
Contributor

@egor-ryashin egor-ryashin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

})
.distinct()
.collect(Collectors.toList());
return new ArrayList<>(timeline.iterateAllObjects());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can result in a different set of holders. lookup() can adjust the first and last holders so that they are aligned with the given interval.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter because the result of this method is a collection of objects, not holders.

…method; Propagate the decision about whether only visible segmetns or visible and overshadowed segments should be returned from IndexerMetadataStorageCoordinator's methods to the user logic; Rename SegmentListUsedAction to RetrieveUsedSegmentsAction, SegmetnListUnusedAction to RetrieveUnusedSegmentsAction, and UsedSegmentLister to UsedSegmentsRetriever
@leventov leventov added the Bug label Sep 27, 2019
@gianm
Copy link
Copy Markdown
Contributor

gianm commented Sep 30, 2019

The checkstyle issue looks legitimate; I'm not sure about the strict compilation one.

@leventov leventov changed the title IndexerSQLMetadataStorageCoordinator.getTimelineForIntervalsWithHandle() don't fetch abutting intervals; Simplify getUsedSegmentsForIntervals() Fix ambiguity about IndexerSQLMetadataStorageCoordinator.getUsedSegmentsForInterval() returning only non-overshadowed or all used segments Oct 11, 2019
@leventov
Copy link
Copy Markdown
Member Author

A much larger issue (ambiguity of getUsedSegmentsForInterval()) surfaced when I investigated the test failures in the original "small refactoring" PR. This is now a completely different, much larger PR. See the updated description.

@egor-ryashin @nishantmonu51 @gianm @jihoonson.

@leventov leventov removed the WIP label Oct 11, 2019
@leventov
Copy link
Copy Markdown
Member Author

Could somebody please review this PR?


/**
* This enum is used a parameter for several methods in {@link VersionedIntervalTimeline}, specifying whether only
* complete partitions should be considered, or incomplete partitions as well.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this explain what "complete partition" means? Also it can link PartitionHolder.isComplete().

Copy link
Copy Markdown
Member Author

@leventov leventov Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since VersionedIntervalTimeline treats completion status mechanically rather than semantically (that is, the implementation of VersionedIntervalTimeline would be the same regardless of what isComplete() semantically means), I just added reference to the method. I think Javadoc should be added to isComplete() method to explain illustratively what does it mean, but I don't feel confident about it. I've also opened issue #8788 which is related.

.list();
}

private Query<Map<String, Object>> createUsedSegmentsSqlQueryForIntervals(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please add a javadoc for this method?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


/**
* This enum is used as a parameter for several methods in {@link IndexerMetadataStorageCoordinator}, specifying whether
* only visible segments, or visible as well as overshadowed segments should be included in results.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this enum is used only in IndexerMetadataStorageCoordinator, it will be more accurate if it says all segments in the result are published segments.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be overspecification for the enum. Instead, I've added "published" adjective to descriptions of all related methods in IndexerMetadataStorageCoordinator.


@Test
public void testSegmentListUsedAction()
public void testRetrieveUsedSegmentsAction()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the name of this class as well.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, changed

import java.util.Set;
import java.util.stream.Collectors;

public class SeekableStreamIndexTaskTestBase extends EasyMockSupport
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check all tests in Kafka/KinesisIndexTaskTest, but just assume they are basically same with before except that common parts are extracted as this base class. Is this correct?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

* @param interval The interval for which all applicable and used datasources are requested. Start is inclusive,
* end is exclusive
* @param visibility Whether only visible or visible as well as overshadowed segments should be returned. The
* visibility is considered within the specified interval: that is, a segment which is globally
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean by "globally visible"?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded to "visible outside of the specified interval(s)"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still unclear to me what "visible outside of an interval" means. Does this imply that a segment can be overshadowed only when we want to find the most recent one among segments in the same interval?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand your question but added a more detailed description of visibility to the Javadoc for Segments and linked from the methods. See this commmit.

for (int i = 0; i < intervals.size(); i++) {
sb.append(
StringUtils.format("(start <= ? AND %1$send%1$s >= ?)", connector.getQuoteString())
StringUtils.format("(start < ? AND %1$send%1$s > ?)", connector.getQuoteString())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

* the intervals in the series.
*
* If not specified otherwise, visibility (or overshadowness) should be assumed on the interval (-inf, +inf). This
* visibility may also be called "global" or "general" visibility.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new doc looks nice. Just one question is, do we need the concept of "global visibility"? Seems like it's enough to say If not specified otherwise, visibility (or overshadowness) should be assumed on the interval (-inf, +inf)..

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, removed the last sentence.

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for quick update.

@jihoonson jihoonson merged commit 5c0fc0a into apache:master Nov 6, 2019
@leventov leventov added this to the 0.17.0 milestone Nov 7, 2019
@leventov leventov deleted the refactor-getUsedSegmentsForInterval branch November 7, 2019 08:41
@jon-wei jon-wei mentioned this pull request Dec 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants