Add test framework to simulate segment loading and balancing#13074
Add test framework to simulate segment loading and balancing#13074kfaraz merged 12 commits intoapache:masterfrom
Conversation
|
This is great! One general suggestion: provide an overview, in a
It seems that the tests don't cover the dynamic aspects: load, the threads which decided when to fire off the control tasks in the coordinator, latencies, etc. It is fine to omit these, they are another level of complexity we could add once the basics work. Still, would be good to state which bits of the code this framework targets. |
| alertEvent.getSeverity(), | ||
| alertEvent.getService(), | ||
| alertEvent.getFeed(), | ||
| alertEvent.getDescription() |
There was a problem hiding this comment.
Pattern expects four string and a number; only four strings provided.
There was a problem hiding this comment.
The last one %n prints a newline.
| } | ||
|
|
||
| final String toLoadQueueSegPath = | ||
| final String toLoadQueueSegPath = curator == null ? null : |
There was a problem hiding this comment.
As far as I can tell, none of the arguments here depend on ZK. So, we can define the path even if we don't actually have a ZK.
| * leading to flakiness in the tests. The simulation sets this field to true by | ||
| * default. | ||
| */ | ||
| public abstract class CoordinatorSimulationBaseTest |
There was a problem hiding this comment.
Consider making this a "fixture" rather than a base test. With a fixture, you write a test like so:
class MyTest
{
CoordinatorSimulationFixture fixture = new CoordinatorSimulationFixture(...);
@Before
public void setup()
{
fixture.setup();
// My own setup
}
@Test
public void myTest()
{
fixture.startSimulation(...);
fixture.doOtherStuff(...);
}The point is that the test class is simple. The test writer can pull in other dependencies without having a multiple-inheritance problem. Etc.
There was a problem hiding this comment.
I did start off with the fixture pattern but later decided to have a base test as it helps avoid every statement beginning with fixture., which begins to feel redundant if every line has it.
The tests that have already been added shouldn't be likely to implement other interfaces.
New tests that have conflicting inheritances could always treat the same base test as a fixture, I guess.
There was a problem hiding this comment.
You could also have the fixture and then also have a BaseTest implemented using the fixture. Then you are effectively actually using a composable fixture for things (and people can fall back to that if need be), but still don't have to repeat fixture. in all of the places.
There was a problem hiding this comment.
Agreed. Although, I think the CoordinatorSimulation object itself is already (somewhat) fulfilling the role of the fixture.
It's easy for a developer to write simulation-based tests without ever having to extend the BaseTest. The BaseTest just provides a bunch of convenience methods which do nothing but delegate to the simulation itself.
In the current tests, other than action invocations on the simulation object,
each test does only the following:
- declare inputs to the simulation, which would be different for each test case
- extract and map metrics from the simulation, which can be commoned out into the simulation itself
- verifications
I will make the updates for 2 thus making the BaseTest an even thinner layer on top of the simulation. But, as there is hardly any common setup required for the tests, I am not sure if we need a fixture just yet.
Edit: There is probably some room to put metric value extraction and verification into a fixture. I will see if we can include it in this PR.
| jacksonConfigManager.watch( | ||
| EasyMock.eq(CoordinatorCompactionConfig.CONFIG_KEY), | ||
| EasyMock.eq(CoordinatorCompactionConfig.class), | ||
| EasyMock.anyObject() |
There was a problem hiding this comment.
Suggestion: rather than mocking a config, add a constructor or builder to the config. A class that cannot be constructed (as with our config classes) is very tedious to use in tests. Mocking isn't the answer since, if there are methods that compute values, those methods also must be mocked.
Another solution is to use Guice. Use the new startup config builder and friends to pass in the set of properties, then ask the injector to create config instances. Going that route is a bit overkill, but it avoids the need to add constructors: we just use the Json config process we already have.
There was a problem hiding this comment.
Suggestion: rather than mocking a config, add a constructor or builder to the config. A class that cannot be constructed (as with our config classes) is very tedious to use in tests. Mocking isn't the answer since, if there are methods that compute values, those methods also must be mocked.
I agree, most of our configs have private fields, no setters and no builders, and it becomes a pain to use them comfortably in tests.
But here, the configs are not being mocked. Only the config manager is mocked and we are setting expectations on that mock.
| testBalancingWithInventorySynced(false); | ||
| } | ||
|
|
||
| private void testBalancingWithInventorySynced(boolean autoSyncInventory) |
|
Thanks a lot for the review, @paul-rogers ! Your understanding of the changes is correct.
We also verify the state of the coordinator itself and the emitted metrics, as the
|
| public String toString() | ||
| { | ||
| return "DutiesRunnable{" + | ||
| "dutiesRunnableAlias='" + dutiesRunnableAlias + '\'' + | ||
| '}'; | ||
| } |
There was a problem hiding this comment.
I know this comment isn't about your code, but your addition of the toString here made me wonder why DruidCoordinator's toString reads as "DutiesRunnable". The class is probably large enough (and already depended upon, see @VisibleForTesting annotation peppering this code) that maybe it's just time to promote it to its own class.
There was a problem hiding this comment.
Oh, yeah, the VisibleForTesting is ubiquitous 😅
It would be good to pull out DutiesRunnable. Right now, it seems to be directly using pretty much all the fields that DruidCoordinator contains. That's probably why it is still hanging around here and why it is not a static inner class either.
The preferable way to do this would be for DruidCoordinator to expose a bunch of methods that update the state of segmentManager and other fields that DutiesRunnable needs to access. And the DutiesRunnable constructor just gets the DruidCoordinator instance. DruidCoordinator already exposes other such utility methods such as moveSegment() or markSegmentAsUnused() which are used by the actual duties themselves.
Let me know if this approach makes sense. We can get it done in a follow-up PR.
| ) | ||
| { | ||
|
|
||
| log.info("Balancing segments in tier [%s]", tier); |
There was a problem hiding this comment.
Is there any more information that can be added to this? Having just the fact that the balancing occurred is useful, but if we can like add sizes or anything else that might be nice to have when trying to understand what happened, that can make it even more useful.
There was a problem hiding this comment.
I intend to clean up the logs and add some more useful metrics around balancing/loading as a follow up to these changes.
| { | ||
| events.add(event); | ||
| if (event instanceof AlertEvent) { | ||
| final AlertEvent alertEvent = (AlertEvent) event; |
There was a problem hiding this comment.
This seems like it is attempting to use logs to validate that alert events were fired? What's wrong with having the AlertEvents in the list? Or, maybe, have 2 lists, one for metrics and one for alerts?
There was a problem hiding this comment.
That makes sense. I did have a dedicated list for alerts during my initial testing but later removed it as I wasn't using it in my tests anymore. Will fix it up.
| * Tests that verify balancing behaviour should set | ||
| * {@link CoordinatorDynamicConfig#useBatchedSegmentSampler()} to true. | ||
| * Otherwise, the segment sampling is random and can produce repeated values | ||
| * leading to flakiness in the tests. The simulation sets this field to true by | ||
| * default. |
There was a problem hiding this comment.
Part of me wonders if this comment doesn't (also?) belong on createDynamicConfig?
| * leading to flakiness in the tests. The simulation sets this field to true by | ||
| * default. | ||
| */ | ||
| public abstract class CoordinatorSimulationBaseTest |
There was a problem hiding this comment.
You could also have the fixture and then also have a BaseTest implemented using the fixture. Then you are effectively actually using a composable fixture for things (and people can fall back to that if need be), but still don't have to repeat fixture. in all of the places.
| - It should not be used to verify the absolute values of execution latencies, e.g. the time taken to compute the | ||
| balancing cost of a segment. But the relative values can still be a good indicator while doing comparisons between, | ||
| say two balancing strategies. |
There was a problem hiding this comment.
What's wrong with trying to use it to benchmark the execution latencies of different balancing strategies?
Or. What's the different between "verify the absolute values of execution latencies" and "be a good indicator while doing comparisons between, say, two balancing strategies"?
There was a problem hiding this comment.
Yeah, I think the language got a little ambiguous.
I meant that we should not try to assert things like "once a segment is queued, it gets processed within 5 seconds" but we can assert things like "across 5 coordinator runs, cachingCost strategy does faster assignment than cost strategy".
I hope that clarifies things a bit. Let me know if the comment should be rephrased.
| HttpResponseHandler<Intermediate, Final> handler | ||
| ) | ||
| { | ||
| throw new UnsupportedOperationException(); |
There was a problem hiding this comment.
Perhaps overly defensive, but I think that this can be an UOE with a message about all expected usages going through the 3-argument call. That way, if this actually does end up getting called at some point in time, the developer will have an idea for what assumption broke.
There was a problem hiding this comment.
I wonder if I shouldn't just allow this one as well. In the 3-arg call, I am not doing anything with the 3rd argument, durationTimeout anyway.
cheddar
left a comment
There was a problem hiding this comment.
Most of my comments were not functional, just design suggestions and discussion of the location of comments. This is also pretty much entirely in the test code, so very low risk in terms of runtime issues. So, gonna go ahead and approve. Please consider the comments and stuff before merge though.
|
Thanks for the review, @cheddar ! Some of the changes you have suggested are planned for future PRs. |
| server.removeDataSegment(segment.getId()); | ||
| segmentCallbacks.forEach( | ||
| (segmentCallback, executor) -> executor.execute( | ||
| () -> segmentCallback.segmentAdded(server.getMetadata(), segment) |
There was a problem hiding this comment.
segmentAdded -> segmentDropped?
Fixes #12822
Description
The framework proposed here make it easy to write tests that verify the behaviour and interactions
of the following entities under various conditions:
DruidCoordinatorHttpLoadQueuePeon,LoadQueueTaskMasterBalanceSegments,RunRules,UnloadUnusedSegments, etc.LoadRule,DropRuleThe framework provides the ability to:
Design
The changes here are slightly different from what was originally proposed,
mostly in terms of how each coordinator cycle is invoked.
or the delay before a scheduled task make the setup precarious and the tests flaky. It also makes
it difficult to reproduce certain situations. All the executors required for coordinator operations
have thus been given only two possible modes here:
done in this mode by simply invoking the tasks at the required time
run. But the results of these actions were unreliable due to race with tasks submitted by the run,
say loading a segment. With the above modes of execution, it is now possible to perform actions
reliably in our tests and recreate any race condition without getting stuck in one ourselves!
For example, a sequence of steps could be to: run coordinator, load one segment from queue,
sync inventory, load remaining segments from queue and verify the final state.
anything at all. There are some test implementations that provide desired behaviour for
executors and metadata store.
choose from two modes of inventory update - auto and manual. In auto update mode, any change
made to the cluster is immediately updated in the coordinator view.
Changes
Add the following main classes:
CoordinatorSimulationand related interfaces to dictate behaviour of simulationCoordinatorSimulationBuilderto build a simulation.CoordinatorSimulationBaseTest,SegmentBalancingTest,SegmentLoadingTestBlockingExecutorServiceto keep submitted tasks in queue and execute themonly when explicitly invoked.
Provide mocked dependencies to the
DruidCoordinatorfor:JacksonConfigManagerLookupCoordinatorManagerProvide test dependencies to the
DruidCoordinatorfor:SegmentsMetadataManager: keeps a list of used segments in memoryHttpClient: sends segment load requests to respective historicalsMetadataRuleManager: keeps rules in memory and allows for update during simulationServerInventoryView: allows synchronization with the current cluster state in the simScheduledExecutorFactory: to create direct or blocked executors and keep a handle to themSome of these test dependencies can later be consolidated with existing utility classes.
Negative tests
Some of the issues identified using this framework have been put together as negative
tests in
SegmentLoadingNegativeTest.Once the underlying issues are fixed as detailed out in #12881 , these tests can be rectified
and moved to the regular
SegmentLoadingTestclass.This PR has: