Skip to content

Multiple fixes for the MSQ stats merging piece#13463

Merged
cryptoe merged 11 commits intoapache:masterfrom
adarshsanjeev:worker-api-validation
Dec 15, 2022
Merged

Multiple fixes for the MSQ stats merging piece#13463
cryptoe merged 11 commits intoapache:masterfrom
adarshsanjeev:worker-api-validation

Conversation

@adarshsanjeev
Copy link
Copy Markdown
Contributor

@adarshsanjeev adarshsanjeev commented Nov 30, 2022

Multiple fixes for the MSQ stats merging piece which was revamped as part of #13205 .

  1. Added validation checks to worker chat handler APIs so that the controller exits instead of forever calling the workers.
  2. Added logging
  3. Fixes race inside WorkerStageKernel
  4. Fixes a bug where the wrong workers were being contacted for the stats
  5. Improves exception handling
  6. Reverts the default mode in clusterByStatisticsMode to PARALLEL from AUTO

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

.build();
}
catch (ISE e) {
log.error(e, "Invalid request for key statistics");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return the error message to the client ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that, I wasn't sure if this needed to be returned to the caller or just logged. I would also assume that we should call ISE.sanitize() before we do so, correct?

Copy link
Copy Markdown
Contributor

@cryptoe cryptoe Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. That makes sense

@cryptoe cryptoe added this to the 25.0 milestone Dec 5, 2022
@cryptoe cryptoe added the Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 label Dec 5, 2022
@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented Dec 10, 2022

@cryptoe , it seems that the coverage is low.

Fixing race
Changing default mode to Parallel
Adding logging.
Fixing exceptions not propagated properly.
@cryptoe cryptoe changed the title Add validation checks to worker chat handler APIs Multiple fixes for the MSQ stats merging piece which Dec 13, 2022
@cryptoe cryptoe added the Bug label Dec 13, 2022
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor queries and suggestions, otherwise LGTM.

.indexIO(indexIO)
.indexMergerV9(indexMerger)
.taskReportFileWriter(
new TaskReportFileWriter()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You could use the existing NoopTaskReportFileWriter here instead.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not able to use that because it was in the indexing module.

Copy link
Copy Markdown
Contributor

@cryptoe cryptoe Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did not make sense to add a dependency for just NoopTaskReportFIleWriter.

OffHeapMemorySegmentWriteOutMediumFactory.instance()
);

mocks = MockitoAnnotations.openMocks(this);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: Since there is a single mocked instance, I suppose it might be simpler to just use Mockito.mock() for the HttpServletRequest req rather than using MockitoAnnotations.

If you do want to use @Mock however, the preferred method would be annotate the class with @RunWith(MockitoJUnitRunner)

Comment on lines 574 to 583
if (stageKernelMap.get(stageId) == null) {
throw new ISE("Requested statistics snapshot for non-existent stageId %s.", stageId);
}
if (stageKernelMap.get(stageId).getResultKeyStatisticsSnapshot() == null) {
throw new ISE(
"Requested statistics snapshot is not generated yet for stageId[%s]",
stageId
);
}
return stageKernelMap.get(stageId).getResultKeyStatisticsSnapshot();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: might be easier to read as an if-else-if chain.

if (queryKernel.getStagePhase(stageId).equals(ControllerStagePhase.MERGING_STATISTICS)) {
List<String> workerTaskIds = workerTaskLauncher.getTaskList();
// we only need tasks which are active for this stage.
List<String> workerTaskIds = workerTaskLauncher.getTaskList()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't getTaskList() already return the list of active tasks only?
Or can it include active tasks from other stages too?
If yes, can we be sure of the order of the items in the returned list such that subList always works as expected?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can be sure that getTaskList() is active tasks only.
It's kind of a coincidence that the moment you commented, I changed this logic to a less brittle one.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, @cryptoe , the new method does seem better. I would also suggest just doing the filtering of workerTaskIds in the controller itself rather than passing an extra argument to the WorkerSketchFetcher just to filter the task ids later. Then the logic in the WorkerSketchFetcher wouldn't have to change and it wouldn't have to be aware of the worker indexes (which seems like an impl detail of the controller).
Hope this makes sense.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this logic would change in #13353 hence added a filtering step.
As we cannot use intstream we still have to change the workerSketcherFetcher

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@adarshsanjeev
Copy link
Copy Markdown
Contributor Author

Changes LGTM, thanks for taking up this PR!

@cryptoe cryptoe merged commit 2b605aa into apache:master Dec 15, 2022
@cryptoe
Copy link
Copy Markdown
Contributor

cryptoe commented Dec 15, 2022

Failures look unrelated.

cryptoe added a commit to cryptoe/druid that referenced this pull request Dec 15, 2022
* Add validation checks to worker chat handler apis

* Merge things and polishing the error messages.

* Minor error message change

* Fixing race and adding some tests

* Fixing controller fetching stats from wrong workers.
Fixing race
Changing default mode to Parallel
Adding logging.
Fixing exceptions not propagated properly.

* Changing to kernel worker count

* Added a better logic to figure out assigned worker for a stage.

* Nits

* Moving to existing kernel methods

* Adding more coverage

Co-authored-by: cryptoe <karankumar1100@gmail.com>
(cherry picked from commit 2b605aa)
cryptoe added a commit that referenced this pull request Dec 15, 2022
* Add validation checks to worker chat handler apis

* Merge things and polishing the error messages.

* Minor error message change

* Fixing race and adding some tests

* Fixing controller fetching stats from wrong workers.
Fixing race
Changing default mode to Parallel
Adding logging.
Fixing exceptions not propagated properly.

* Changing to kernel worker count

* Added a better logic to figure out assigned worker for a stage.

* Nits

* Moving to existing kernel methods

* Adding more coverage

Co-authored-by: cryptoe <karankumar1100@gmail.com>
(cherry picked from commit 2b605aa)

Co-authored-by: Adarsh Sanjeev <adarshsanjeev@gmail.com>
@cryptoe cryptoe changed the title Multiple fixes for the MSQ stats merging piece which Multiple fixes for the MSQ stats merging piece Dec 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants