Skip to content

supervisor/autoscaler: Fix clearing of collected lags on skipped scale actions#17356

Merged
suneet-s merged 6 commits intoapache:masterfrom
ac9817:fix-scaler-bug
Oct 17, 2024
Merged

supervisor/autoscaler: Fix clearing of collected lags on skipped scale actions#17356
suneet-s merged 6 commits intoapache:masterfrom
ac9817:fix-scaler-bug

Conversation

@ac9817
Copy link
Copy Markdown
Contributor

@ac9817 ac9817 commented Oct 15, 2024

Description

There are two things that are done after a scaling action is successfully initiated,

  • Clear the collected lag metrics until now
  • Record the time so that we don't scale for next minTriggerScaleActionFrequencyMillis

Currently, these two actions are decoupled and there could be cases when supervisor could be publishing or when minTriggerScaleActionFrequencyMillis has not yet been elapsed in which case the scaling action could be skipped. While skipping that action, we are clearing the collected lag metrics but not actually updating the task count. So the next decision for scaling would be based on looking at shorter windows than configured which could cause more scaling actions than required.

This PR updates to clear the collected lag only when supervisor is truly scaled.


Key changed/added classes in this PR
  • SeekableStreamSupervisor
  • LagBasedAutoScaler

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Copy Markdown
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change looks good, but have some naming suggestions to make things a little clearer to the reader.

Copy link
Copy Markdown
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 after CI

@github-actions github-actions Bot added the GHA label Oct 16, 2024
public void testSeekableStreamSupervisorSpecWithScaleInThresholdGreaterThanPartitions() throws InterruptedException
{
EasyMock.expect(spec.getSupervisorStateManagerConfig()).andReturn(supervisorConfig).anyTimes();
EasyMock.expect(spec.getDataSchema()).andReturn(getDataSchema()).anyTimes();

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation

Invoking [SeekableStreamSupervisorSpec.getDataSchema](1) should be avoided because it has been deprecated.
Thread.sleep(1 * 1000);
int taskCountAfterScale = supervisor.getIoConfig().getTaskCount();
Assert.assertEquals(2, taskCountAfterScale);
autoscaler.stop();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was causing issues randomly because the autoscaler was kept running and was emitting events which was same listener for the whole suite of tests in this class.

@abhishekrb19 abhishekrb19 removed the GHA label Oct 17, 2024
@suneet-s
Copy link
Copy Markdown
Contributor

Overruling CI - The code changes have no impact on the SuperSorterTest AFAICT.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fe27cd15629, pid=3524, tid=3541
#
# JRE version: OpenJDK Runtime Environment Zulu21.38+21-CA (21.0.5+11) (build 21.0.5+11-LTS)
# Java VM: OpenJDK 64-Bit Server VM Zulu21.38+21-CA (21.0.5+11-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xf15629]  BoolNode::Ideal(PhaseGVN*, bool)+0x19
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/runner/work/druid/druid/processing/core.3524)
#
# JFR recording file will be written. Location: /home/runner/work/druid/druid/processing/hs_err_pid3524.jfr
#
# An error report file with more information is saved as:
# /home/runner/work/druid/druid/processing/hs_err_pid3524.log
#
# Compiler replay data is saved as:
# /home/runner/work/druid/druid/processing/replay_pid3524.log
#
# If you would like to submit a bug report, please visit:
#   http://www.azul.com/support/
#
Aborted (core dumped)
[INFO] 
[INFO] Results:
[INFO] 
Error:  Failures: 
Error:    SuperSorterTest$ParameterizedCasesTest.test_clusterByQualityLongDescRowNumberAsc_fourPartitions:714->verifySuperSorter:455 expected:<1.0> but was:<0.6666666666666666>
[INFO] 
Error:  Tests run: 265141, Failures: 1, Errors: 0, Skipped: [6715](https://github.com/apache/druid/actions/runs/11371919207/job/31688617666?pr=17356#step:11:6716)

@suneet-s suneet-s merged commit e834e49 into apache:master Oct 17, 2024
@adarshsanjeev adarshsanjeev added this to the 32.0.0 milestone Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants