Skip to content

Add test framework to simulate segment loading and balancing#13074

Merged
kfaraz merged 12 commits intoapache:masterfrom
kfaraz:coordinator_test_fwrk
Sep 21, 2022
Merged

Add test framework to simulate segment loading and balancing#13074
kfaraz merged 12 commits intoapache:masterfrom
kfaraz:coordinator_test_fwrk

Conversation

@kfaraz
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz commented Sep 12, 2022

Fixes #12822

Description

The framework proposed here make it easy to write tests that verify the behaviour and interactions
of the following entities under various conditions:

  • DruidCoordinator
  • HttpLoadQueuePeon, LoadQueueTaskMaster
  • coordinator duties: BalanceSegments, RunRules, UnloadUnusedSegments, etc.
  • datasource retention rules: LoadRule, DropRule

The framework provides the ability to:

  • recreate varied cluster setups
  • run simulations that can cycle through several coordinator runs quickly
  • verify results of the simulation using emitted metrics, cluster state and inventory view
  • modify the state of the simulation during a run: cluster, segments, retention rules, etc.
  • use multiple execution modes: immediate/lazy segment loading, immediate/lazy inventory sync
  • write new tests with no additional mocking

Design

The changes here are slightly different from what was originally proposed,
mostly in terms of how each coordinator cycle is invoked.

  1. Execution: A tight dependency on time durations such as the period of a repeating task
    or the delay before a scheduled task make the setup precarious and the tests flaky. It also makes
    it difficult to reproduce certain situations. All the executors required for coordinator operations
    have thus been given only two possible modes here:
    • immediate: direct execution on the calling thread
    • blocked: tasks kept in a queue until explicitly invoked. Time-based verifications can still be
      done in this mode by simply invoking the tasks at the required time
  2. Actions: The proposal discussed performing cluster actions at the end of each coordinator
    run. But the results of these actions were unreliable due to race with tasks submitted by the run,
    say loading a segment. With the above modes of execution, it is now possible to perform actions
    reliably in our tests and recreate any race condition without getting stuck in one ourselves!
    For example, a sequence of steps could be to: run coordinator, load one segment from queue,
    sync inventory, load remaining segments from queue and verify the final state.
  3. Dependencies: There is minimal mocking in the framework and new tests need not mock
    anything at all. There are some test implementations that provide desired behaviour for
    executors and metadata store.
  4. Inventory: The coordinator maintains an inventory view of the cluster state. Simulations can
    choose from two modes of inventory update - auto and manual. In auto update mode, any change
    made to the cluster is immediately updated in the coordinator view.

Changes

Add the following main classes:

  • CoordinatorSimulation and related interfaces to dictate behaviour of simulation
  • CoordinatorSimulationBuilder to build a simulation.
  • CoordinatorSimulationBaseTest, SegmentBalancingTest, SegmentLoadingTest
  • BlockingExecutorService to keep submitted tasks in queue and execute them
    only when explicitly invoked.

Provide mocked dependencies to the DruidCoordinator for:

  • JacksonConfigManager
  • LookupCoordinatorManager

Provide test dependencies to the DruidCoordinator for:

  • SegmentsMetadataManager: keeps a list of used segments in memory
  • HttpClient: sends segment load requests to respective historicals
  • MetadataRuleManager: keeps rules in memory and allows for update during simulation
  • ServerInventoryView: allows synchronization with the current cluster state in the sim
  • ScheduledExecutorFactory: to create direct or blocked executors and keep a handle to them

Some of these test dependencies can later be consolidated with existing utility classes.

Negative tests

Some of the issues identified using this framework have been put together as negative
tests in SegmentLoadingNegativeTest.
Once the underlying issues are fixed as detailed out in #12881 , these tests can be rectified
and moved to the regular SegmentLoadingTest class.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

@kfaraz kfaraz changed the title Add coordinator test framework Add test framework to simulate segment loading and balancing Sep 12, 2022
@paul-rogers
Copy link
Copy Markdown
Contributor

This is great! One general suggestion: provide an overview, in a README.md file, of the general design of the framework. By this I mean, what is being tested, what is simulated, and how the tests work. What I think I learned from looking at the code is:

  • We simulate the cluster by responding to command set from the coordinator.
  • We wrap the algorithm part of the coordinator in a "fixture" that we invoke from tests.
  • We verify results by looking at the state of the simulated nodes.

It seems that the tests don't cover the dynamic aspects: load, the threads which decided when to fire off the control tasks in the coordinator, latencies, etc. It is fine to omit these, they are another level of complexity we could add once the basics work. Still, would be good to state which bits of the code this framework targets.

alertEvent.getSeverity(),
alertEvent.getService(),
alertEvent.getFeed(),
alertEvent.getDescription()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pattern expects four string and a number; only four strings provided.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last one %n prints a newline.

}

final String toLoadQueueSegPath =
final String toLoadQueueSegPath = curator == null ? null :
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, none of the arguments here depend on ZK. So, we can define the path even if we don't actually have a ZK.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

* leading to flakiness in the tests. The simulation sets this field to true by
* default.
*/
public abstract class CoordinatorSimulationBaseTest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this a "fixture" rather than a base test. With a fixture, you write a test like so:

class MyTest
{
  CoordinatorSimulationFixture fixture = new CoordinatorSimulationFixture(...);

  @Before
  public void setup()
  {
    fixture.setup();
    // My own setup
  }

  @Test
  public void myTest()
  {
    fixture.startSimulation(...);
    fixture.doOtherStuff(...);
  }

The point is that the test class is simple. The test writer can pull in other dependencies without having a multiple-inheritance problem. Etc.

Copy link
Copy Markdown
Contributor Author

@kfaraz kfaraz Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did start off with the fixture pattern but later decided to have a base test as it helps avoid every statement beginning with fixture., which begins to feel redundant if every line has it.

The tests that have already been added shouldn't be likely to implement other interfaces.
New tests that have conflicting inheritances could always treat the same base test as a fixture, I guess.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also have the fixture and then also have a BaseTest implemented using the fixture. Then you are effectively actually using a composable fixture for things (and people can fall back to that if need be), but still don't have to repeat fixture. in all of the places.

Copy link
Copy Markdown
Contributor Author

@kfaraz kfaraz Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Although, I think the CoordinatorSimulation object itself is already (somewhat) fulfilling the role of the fixture.

It's easy for a developer to write simulation-based tests without ever having to extend the BaseTest. The BaseTest just provides a bunch of convenience methods which do nothing but delegate to the simulation itself.

In the current tests, other than action invocations on the simulation object,
each test does only the following:

  1. declare inputs to the simulation, which would be different for each test case
  2. extract and map metrics from the simulation, which can be commoned out into the simulation itself
  3. verifications

I will make the updates for 2 thus making the BaseTest an even thinner layer on top of the simulation. But, as there is hardly any common setup required for the tests, I am not sure if we need a fixture just yet.

Edit: There is probably some room to put metric value extraction and verification into a fixture. I will see if we can include it in this PR.

jacksonConfigManager.watch(
EasyMock.eq(CoordinatorCompactionConfig.CONFIG_KEY),
EasyMock.eq(CoordinatorCompactionConfig.class),
EasyMock.anyObject()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: rather than mocking a config, add a constructor or builder to the config. A class that cannot be constructed (as with our config classes) is very tedious to use in tests. Mocking isn't the answer since, if there are methods that compute values, those methods also must be mocked.

Another solution is to use Guice. Use the new startup config builder and friends to pass in the set of properties, then ask the injector to create config instances. Going that route is a bit overkill, but it avoids the need to add constructors: we just use the Json config process we already have.

Copy link
Copy Markdown
Contributor Author

@kfaraz kfaraz Sep 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: rather than mocking a config, add a constructor or builder to the config. A class that cannot be constructed (as with our config classes) is very tedious to use in tests. Mocking isn't the answer since, if there are methods that compute values, those methods also must be mocked.

I agree, most of our configs have private fields, no setters and no builders, and it becomes a pain to use them comfortably in tests.

But here, the configs are not being mocked. Only the config manager is mocked and we are setting expectations on that mock.

testBalancingWithInventorySynced(false);
}

private void testBalancingWithInventorySynced(boolean autoSyncInventory)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty cool!

@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented Sep 16, 2022

Thanks a lot for the review, @paul-rogers !
I will be sure to include a README to help other developers write more of these tests.

Your understanding of the changes is correct.

We verify results by looking at the state of the simulated nodes.

We also verify the state of the coordinator itself and the emitted metrics, as the DruidCoordinator is the primary entity under test (I will clarify these in the README).

It seems that the tests don't cover the dynamic aspects: load, the threads which decided when to fire off the control tasks in the coordinator, latencies, etc.

  • Yes, we do not verify latency of an operation.
  • The behaviour to actually load a segment would always be mocked (as it happens on a historical).
    Here, we would only want to control when the load happens and whether it succeeds or fails.
  • The simulation maintains a handle to all the executors used inside the coordinator. It can thus choose
    to invoke pending tasks of a certain executor at a certain step to recreate race conditions. For example,
    a sequence of steps could be to: run coordinator, load one segment from queue, sync inventory, load
    remaining segments from queue and verify the final state.

Comment on lines +978 to +983
public String toString()
{
return "DutiesRunnable{" +
"dutiesRunnableAlias='" + dutiesRunnableAlias + '\'' +
'}';
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this comment isn't about your code, but your addition of the toString here made me wonder why DruidCoordinator's toString reads as "DutiesRunnable". The class is probably large enough (and already depended upon, see @VisibleForTesting annotation peppering this code) that maybe it's just time to promote it to its own class.

Copy link
Copy Markdown
Contributor Author

@kfaraz kfaraz Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, the VisibleForTesting is ubiquitous 😅

It would be good to pull out DutiesRunnable. Right now, it seems to be directly using pretty much all the fields that DruidCoordinator contains. That's probably why it is still hanging around here and why it is not a static inner class either.

The preferable way to do this would be for DruidCoordinator to expose a bunch of methods that update the state of segmentManager and other fields that DutiesRunnable needs to access. And the DutiesRunnable constructor just gets the DruidCoordinator instance. DruidCoordinator already exposes other such utility methods such as moveSegment() or markSegmentAsUnused() which are used by the actual duties themselves.

Let me know if this approach makes sense. We can get it done in a follow-up PR.

)
{

log.info("Balancing segments in tier [%s]", tier);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any more information that can be added to this? Having just the fact that the balancing occurred is useful, but if we can like add sizes or anything else that might be nice to have when trying to understand what happened, that can make it even more useful.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend to clean up the logs and add some more useful metrics around balancing/loading as a follow up to these changes.

{
events.add(event);
if (event instanceof AlertEvent) {
final AlertEvent alertEvent = (AlertEvent) event;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it is attempting to use logs to validate that alert events were fired? What's wrong with having the AlertEvents in the list? Or, maybe, have 2 lists, one for metrics and one for alerts?

Copy link
Copy Markdown
Contributor Author

@kfaraz kfaraz Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I did have a dedicated list for alerts during my initial testing but later removed it as I wasn't using it in my tests anymore. Will fix it up.

Comment on lines +48 to +52
* Tests that verify balancing behaviour should set
* {@link CoordinatorDynamicConfig#useBatchedSegmentSampler()} to true.
* Otherwise, the segment sampling is random and can produce repeated values
* leading to flakiness in the tests. The simulation sets this field to true by
* default.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of me wonders if this comment doesn't (also?) belong on createDynamicConfig?

* leading to flakiness in the tests. The simulation sets this field to true by
* default.
*/
public abstract class CoordinatorSimulationBaseTest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also have the fixture and then also have a BaseTest implemented using the fixture. Then you are effectively actually using a composable fixture for things (and people can fall back to that if need be), but still don't have to repeat fixture. in all of the places.

Comment on lines +74 to +76
- It should not be used to verify the absolute values of execution latencies, e.g. the time taken to compute the
balancing cost of a segment. But the relative values can still be a good indicator while doing comparisons between,
say two balancing strategies.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with trying to use it to benchmark the execution latencies of different balancing strategies?

Or. What's the different between "verify the absolute values of execution latencies" and "be a good indicator while doing comparisons between, say, two balancing strategies"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the language got a little ambiguous.

I meant that we should not try to assert things like "once a segment is queued, it gets processed within 5 seconds" but we can assert things like "across 5 coordinator runs, cachingCost strategy does faster assignment than cost strategy".

I hope that clarifies things a bit. Let me know if the comment should be rephrased.

HttpResponseHandler<Intermediate, Final> handler
)
{
throw new UnsupportedOperationException();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps overly defensive, but I think that this can be an UOE with a message about all expected usages going through the 3-argument call. That way, if this actually does end up getting called at some point in time, the developer will have an idea for what assumption broke.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if I shouldn't just allow this one as well. In the 3-arg call, I am not doing anything with the 3rd argument, durationTimeout anyway.

Copy link
Copy Markdown
Contributor

@cheddar cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my comments were not functional, just design suggestions and discussion of the location of comments. This is also pretty much entirely in the test code, so very low risk in terms of runtime issues. So, gonna go ahead and approve. Please consider the comments and stuff before merge though.

@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented Sep 20, 2022

Thanks for the review, @cheddar ! Some of the changes you have suggested are planned for future PRs.
I will include the smaller ones here.

@kfaraz kfaraz merged commit 0039409 into apache:master Sep 21, 2022
@kfaraz kfaraz deleted the coordinator_test_fwrk branch September 21, 2022 08:06
server.removeDataSegment(segment.getId());
segmentCallbacks.forEach(
(segmentCallback, executor) -> executor.execute(
() -> segmentCallback.segmentAdded(server.getMetadata(), segment)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

segmentAdded -> segmentDropped?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Proposal: Test framework to simulate segment balancing

6 participants