Add test framework to simulate segment loading and balancing by kfaraz · Pull Request #13074 · apache/druid

kfaraz · 2022-09-12T07:53:37Z

Description

The framework proposed here make it easy to write tests that verify the behaviour and interactions
of the following entities under various conditions:

DruidCoordinator
HttpLoadQueuePeon, LoadQueueTaskMaster
coordinator duties: BalanceSegments, RunRules, UnloadUnusedSegments, etc.
datasource retention rules: LoadRule, DropRule

The framework provides the ability to:

recreate varied cluster setups
run simulations that can cycle through several coordinator runs quickly
verify results of the simulation using emitted metrics, cluster state and inventory view
modify the state of the simulation during a run: cluster, segments, retention rules, etc.
use multiple execution modes: immediate/lazy segment loading, immediate/lazy inventory sync
write new tests with no additional mocking

Design

The changes here are slightly different from what was originally proposed,
mostly in terms of how each coordinator cycle is invoked.

Execution: A tight dependency on time durations such as the period of a repeating task
or the delay before a scheduled task make the setup precarious and the tests flaky. It also makes
it difficult to reproduce certain situations. All the executors required for coordinator operations
have thus been given only two possible modes here:
- immediate: direct execution on the calling thread
- blocked: tasks kept in a queue until explicitly invoked. Time-based verifications can still be
  done in this mode by simply invoking the tasks at the required time
Actions: The proposal discussed performing cluster actions at the end of each coordinator
run. But the results of these actions were unreliable due to race with tasks submitted by the run,
say loading a segment. With the above modes of execution, it is now possible to perform actions
reliably in our tests and recreate any race condition without getting stuck in one ourselves!
For example, a sequence of steps could be to: run coordinator, load one segment from queue,
sync inventory, load remaining segments from queue and verify the final state.
Dependencies: There is minimal mocking in the framework and new tests need not mock
anything at all. There are some test implementations that provide desired behaviour for
executors and metadata store.
Inventory: The coordinator maintains an inventory view of the cluster state. Simulations can
choose from two modes of inventory update - auto and manual. In auto update mode, any change
made to the cluster is immediately updated in the coordinator view.

Changes

Add the following main classes:

CoordinatorSimulation and related interfaces to dictate behaviour of simulation
CoordinatorSimulationBuilder to build a simulation.
CoordinatorSimulationBaseTest, SegmentBalancingTest, SegmentLoadingTest
BlockingExecutorService to keep submitted tasks in queue and execute them
only when explicitly invoked.

Provide mocked dependencies to the DruidCoordinator for:

JacksonConfigManager
LookupCoordinatorManager

Provide test dependencies to the DruidCoordinator for:

SegmentsMetadataManager: keeps a list of used segments in memory
HttpClient: sends segment load requests to respective historicals
MetadataRuleManager: keeps rules in memory and allows for update during simulation
ServerInventoryView: allows synchronization with the current cluster state in the sim
ScheduledExecutorFactory: to create direct or blocked executors and keep a handle to them

Some of these test dependencies can later be consolidated with existing utility classes.

Negative tests

Some of the issues identified using this framework have been put together as negative
tests in SegmentLoadingNegativeTest.
Once the underlying issues are fixed as detailed out in #12881 , these tests can be rectified
and moved to the regular SegmentLoadingTest class.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

…ator_test_fwrk

paul-rogers · 2022-09-16T00:40:55Z

This is great! One general suggestion: provide an overview, in a README.md file, of the general design of the framework. By this I mean, what is being tested, what is simulated, and how the tests work. What I think I learned from looking at the code is:

We simulate the cluster by responding to command set from the coordinator.
We wrap the algorithm part of the coordinator in a "fixture" that we invoke from tests.
We verify results by looking at the state of the simulated nodes.

It seems that the tests don't cover the dynamic aspects: load, the threads which decided when to fire off the control tasks in the coordinator, latencies, etc. It is fine to omit these, they are another level of complexity we could add once the basics work. Still, would be good to state which bits of the code this framework targets.

paul-rogers · 2022-09-16T00:13:18Z

+          alertEvent.getSeverity(),
+          alertEvent.getService(),
+          alertEvent.getFeed(),
+          alertEvent.getDescription()


Pattern expects four string and a number; only four strings provided.

The last one %n prints a newline.

paul-rogers · 2022-09-16T00:14:42Z

      }

-      final String toLoadQueueSegPath =
+      final String toLoadQueueSegPath = curator == null ? null :


As far as I can tell, none of the arguments here depend on ZK. So, we can define the path even if we don't actually have a ZK.

paul-rogers · 2022-09-16T00:24:57Z

+ * leading to flakiness in the tests. The simulation sets this field to true by
+ * default.
+ */
+public abstract class CoordinatorSimulationBaseTest


Consider making this a "fixture" rather than a base test. With a fixture, you write a test like so:

class MyTest { CoordinatorSimulationFixture fixture = new CoordinatorSimulationFixture(...); @Before public void setup() { fixture.setup(); // My own setup } @Test public void myTest() { fixture.startSimulation(...); fixture.doOtherStuff(...); }

The point is that the test class is simple. The test writer can pull in other dependencies without having a multiple-inheritance problem. Etc.

I did start off with the fixture pattern but later decided to have a base test as it helps avoid every statement beginning with fixture., which begins to feel redundant if every line has it.

The tests that have already been added shouldn't be likely to implement other interfaces.
New tests that have conflicting inheritances could always treat the same base test as a fixture, I guess.

You could also have the fixture and then also have a BaseTest implemented using the fixture. Then you are effectively actually using a composable fixture for things (and people can fall back to that if need be), but still don't have to repeat fixture. in all of the places.

Agreed. Although, I think the CoordinatorSimulation object itself is already (somewhat) fulfilling the role of the fixture.

It's easy for a developer to write simulation-based tests without ever having to extend the BaseTest. The BaseTest just provides a bunch of convenience methods which do nothing but delegate to the simulation itself.

In the current tests, other than action invocations on the simulation object,
each test does only the following:

declare inputs to the simulation, which would be different for each test case

extract and map metrics from the simulation, which can be commoned out into the simulation itself

verifications

I will make the updates for 2 thus making the BaseTest an even thinner layer on top of the simulation. But, as there is hardly any common setup required for the tests, I am not sure if we need a fixture just yet.

Edit: There is probably some room to put metric value extraction and verification into a fixture. I will see if we can include it in this PR.

paul-rogers · 2022-09-16T00:30:42Z

+          jacksonConfigManager.watch(
+              EasyMock.eq(CoordinatorCompactionConfig.CONFIG_KEY),
+              EasyMock.eq(CoordinatorCompactionConfig.class),
+              EasyMock.anyObject()


Suggestion: rather than mocking a config, add a constructor or builder to the config. A class that cannot be constructed (as with our config classes) is very tedious to use in tests. Mocking isn't the answer since, if there are methods that compute values, those methods also must be mocked.

Another solution is to use Guice. Use the new startup config builder and friends to pass in the set of properties, then ask the injector to create config instances. Going that route is a bit overkill, but it avoids the need to add constructors: we just use the Json config process we already have.

Suggestion: rather than mocking a config, add a constructor or builder to the config. A class that cannot be constructed (as with our config classes) is very tedious to use in tests. Mocking isn't the answer since, if there are methods that compute values, those methods also must be mocked.

I agree, most of our configs have private fields, no setters and no builders, and it becomes a pain to use them comfortably in tests.

But here, the configs are not being mocked. Only the config manager is mocked and we are setting expectations on that mock.

paul-rogers · 2022-09-16T00:33:27Z

+    testBalancingWithInventorySynced(false);
+  }
+
+  private void testBalancingWithInventorySynced(boolean autoSyncInventory)


Pretty cool!

kfaraz · 2022-09-16T03:10:51Z

Thanks a lot for the review, @paul-rogers !
I will be sure to include a README to help other developers write more of these tests.

Your understanding of the changes is correct.

We verify results by looking at the state of the simulated nodes.

We also verify the state of the coordinator itself and the emitted metrics, as the DruidCoordinator is the primary entity under test (I will clarify these in the README).

It seems that the tests don't cover the dynamic aspects: load, the threads which decided when to fire off the control tasks in the coordinator, latencies, etc.

Yes, we do not verify latency of an operation.
The behaviour to actually load a segment would always be mocked (as it happens on a historical).
Here, we would only want to control when the load happens and whether it succeeds or fails.
The simulation maintains a handle to all the executors used inside the coordinator. It can thus choose
to invoke pending tasks of a certain executor at a certain step to recreate race conditions. For example,
a sequence of steps could be to: run coordinator, load one segment from queue, sync inventory, load
remaining segments from queue and verify the final state.

…t_fwrk

imply-cheddar · 2022-09-20T03:54:44Z

+    public String toString()
+    {
+      return "DutiesRunnable{" +
+             "dutiesRunnableAlias='" + dutiesRunnableAlias + '\'' +
+             '}';
+    }


I know this comment isn't about your code, but your addition of the toString here made me wonder why DruidCoordinator's toString reads as "DutiesRunnable". The class is probably large enough (and already depended upon, see @VisibleForTesting annotation peppering this code) that maybe it's just time to promote it to its own class.

Oh, yeah, the VisibleForTesting is ubiquitous 😅

It would be good to pull out DutiesRunnable. Right now, it seems to be directly using pretty much all the fields that DruidCoordinator contains. That's probably why it is still hanging around here and why it is not a static inner class either.

The preferable way to do this would be for DruidCoordinator to expose a bunch of methods that update the state of segmentManager and other fields that DutiesRunnable needs to access. And the DutiesRunnable constructor just gets the DruidCoordinator instance. DruidCoordinator already exposes other such utility methods such as moveSegment() or markSegmentAsUnused() which are used by the actual duties themselves.

Let me know if this approach makes sense. We can get it done in a follow-up PR.

imply-cheddar · 2022-09-20T03:55:57Z

  )
  {

+    log.info("Balancing segments in tier [%s]", tier);


Is there any more information that can be added to this? Having just the fact that the balancing occurred is useful, but if we can like add sizes or anything else that might be nice to have when trying to understand what happened, that can make it even more useful.

I intend to clean up the logs and add some more useful metrics around balancing/loading as a follow up to these changes.

imply-cheddar · 2022-09-20T03:58:47Z

  {
-    events.add(event);
+    if (event instanceof AlertEvent) {
+      final AlertEvent alertEvent = (AlertEvent) event;


This seems like it is attempting to use logs to validate that alert events were fired? What's wrong with having the AlertEvents in the list? Or, maybe, have 2 lists, one for metrics and one for alerts?

That makes sense. I did have a dedicated list for alerts during my initial testing but later removed it as I wasn't using it in my tests anymore. Will fix it up.

imply-cheddar · 2022-09-20T04:03:05Z

+ * Tests that verify balancing behaviour should set
+ * {@link CoordinatorDynamicConfig#useBatchedSegmentSampler()} to true.
+ * Otherwise, the segment sampling is random and can produce repeated values
+ * leading to flakiness in the tests. The simulation sets this field to true by
+ * default.


Part of me wonders if this comment doesn't (also?) belong on createDynamicConfig?

imply-cheddar · 2022-09-20T04:07:56Z

+ * leading to flakiness in the tests. The simulation sets this field to true by
+ * default.
+ */
+public abstract class CoordinatorSimulationBaseTest


You could also have the fixture and then also have a BaseTest implemented using the fixture. Then you are effectively actually using a composable fixture for things (and people can fall back to that if need be), but still don't have to repeat fixture. in all of the places.

imply-cheddar · 2022-09-20T04:11:04Z

+- It should not be used to verify the absolute values of execution latencies, e.g. the time taken to compute the
+  balancing cost of a segment. But the relative values can still be a good indicator while doing comparisons between,
+  say two balancing strategies.


What's wrong with trying to use it to benchmark the execution latencies of different balancing strategies?

Or. What's the different between "verify the absolute values of execution latencies" and "be a good indicator while doing comparisons between, say, two balancing strategies"?

Yeah, I think the language got a little ambiguous.

I meant that we should not try to assert things like "once a segment is queued, it gets processed within 5 seconds" but we can assert things like "across 5 coordinator runs, cachingCost strategy does faster assignment than cost strategy".

I hope that clarifies things a bit. Let me know if the comment should be rephrased.

imply-cheddar · 2022-09-20T04:16:08Z

+      HttpResponseHandler<Intermediate, Final> handler
+  )
+  {
+    throw new UnsupportedOperationException();


Perhaps overly defensive, but I think that this can be an UOE with a message about all expected usages going through the 3-argument call. That way, if this actually does end up getting called at some point in time, the developer will have an idea for what assumption broke.

I wonder if I shouldn't just allow this one as well. In the 3-arg call, I am not doing anything with the 3rd argument, durationTimeout anyway.

cheddar

Most of my comments were not functional, just design suggestions and discussion of the location of comments. This is also pretty much entirely in the test code, so very low risk in terms of runtime issues. So, gonna go ahead and approve. Please consider the comments and stuff before merge though.

kfaraz · 2022-09-20T04:53:16Z

Thanks for the review, @cheddar ! Some of the changes you have suggested are planned for future PRs.
I will include the smaller ones here.

AmatyaAvadhanula · 2022-10-02T09:26:53Z

+      server.removeDataSegment(segment.getId());
+      segmentCallbacks.forEach(
+          (segmentCallback, executor) -> executor.execute(
+              () -> segmentCallback.segmentAdded(server.getMetadata(), segment)


segmentAdded -> segmentDropped?

Add coordinator test framework

fa248fa

kfaraz added the Area - Segment Balancing/Coordination label Sep 12, 2022

kfaraz changed the title ~~Add coordinator test framework~~ Add test framework to simulate segment loading and balancing Sep 12, 2022

kfaraz added 6 commits September 12, 2022 13:41

Remove outdated changes

61c71ec

Add more tests

63bf325

Add option to auto-sync inventory

b9db352

Minor cleanup

b4c83dd

Fix inspections

4dc210f

Merge branch 'master' of https://github.com/apache/druid into coordin…

76024b4

…ator_test_fwrk

kfaraz added the Design Review label Sep 15, 2022

abhishekagarwal87 added the Area - Testing label Sep 15, 2022

paul-rogers reviewed Sep 16, 2022

View reviewed changes

Add README for simulations, add SegmentLoadingNegativeTest

32c9f14

This was referenced Sep 18, 2022

Fix over-replication caused by balancing when inventory is not updated yet kfaraz/druid#14

Closed

Fix over-replication caused by balancing when inventory is not updated yet #13114

Merged

kfaraz added 2 commits September 18, 2022 17:18

Merge branch 'master' of github.com:apache/druid into coordinator_tes…

c8c344e

…t_fwrk

Add license to simulate/README

3a3d2c4

imply-cheddar reviewed Sep 20, 2022

View reviewed changes

cheddar approved these changes Sep 20, 2022

View reviewed changes

kfaraz added 2 commits September 20, 2022 12:40

Collect ServiceMetricEvents in StubServiceEmitter

2479e4e

Fix checkstyle

2bf2985

kfaraz merged commit 0039409 into apache:master Sep 21, 2022

kfaraz deleted the coordinator_test_fwrk branch September 21, 2022 08:06

AmatyaAvadhanula reviewed Oct 2, 2022

View reviewed changes

kfaraz added this to the 25.0 milestone Nov 22, 2022

kfaraz mentioned this pull request Jan 19, 2024

A simulator for segment balancing by the coordinator #9087

Closed

Conversation

kfaraz commented Sep 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Design

Changes

Negative tests

Uh oh!

paul-rogers commented Sep 16, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz Sep 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz Sep 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz Sep 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz commented Sep 16, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz Sep 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaraz Sep 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheddar left a comment

Choose a reason for hiding this comment

Uh oh!

kfaraz commented Sep 20, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

kfaraz commented Sep 12, 2022 •

edited

Loading

kfaraz Sep 16, 2022 •

edited

Loading

kfaraz Sep 20, 2022 •

edited

Loading

kfaraz Sep 16, 2022 •

edited

Loading

kfaraz Sep 20, 2022 •

edited

Loading

kfaraz Sep 20, 2022 •

edited

Loading