Skip to content

Fixing an issue in sequential merge #14574

Merged
cryptoe merged 3 commits intoapache:masterfrom
cryptoe:sequential_bug_fix_1
Jul 12, 2023
Merged

Fixing an issue in sequential merge #14574
cryptoe merged 3 commits intoapache:masterfrom
cryptoe:sequential_bug_fix_1

Conversation

@cryptoe
Copy link
Copy Markdown
Contributor

@cryptoe cryptoe commented Jul 12, 2023

Fixing an issue in sequential merge where workers without any partial key statistics would get stuck because controller did not change the worker state.

The symptom's are the controller task getting stuck after some of the worker boundaries are sent.

2023-07-12T04:20:46,033 INFO [task-runner-0-priority-0] org.apache.druid.msq.exec.ControllerImpl - Query [xxx] sending out partition boundaries for stage 0 for workers {7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 34, 35, 36}

Key changed/added classes in this PR
  • WorkerSketchFetcher

This PR has:

  • been self-reviewed.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

… key statistics would get stuck because controller did not change the worker state.
@cryptoe cryptoe added Bug Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Jul 12, 2023
@cryptoe cryptoe added this to the 27.0 milestone Jul 12, 2023
Copy link
Copy Markdown
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the controller determining that no boundary received means an empty partition, I think it would be more appropriate if this logic is built into the worker. If the output row is none, it would report ClusterByStatisticsSnapshot.empty(), then it should report the empty partition.
Seems cleaner to me because, in that way, we are verifying that the output rows are 0 on the worker, WDYT?

@LakshSingla
Copy link
Copy Markdown
Contributor

Can we add a test in MSQSelectTests that verify that this is working as expected?

@cryptoe
Copy link
Copy Markdown
Contributor Author

cryptoe commented Jul 12, 2023

@LakshSingla Thanks for the review.

Instead of the controller determining that no boundary received means an empty partition, I think it would be more appropriate if this logic is built into the worker. If the output row is none, it would report ClusterByStatisticsSnapshot.empty(), then it should report the empty partition.
Seems cleaner to me because, in that way, we are verifying that the output rows are 0 on the worker, WDYT

I thought about this. The trickiness lies in how we calculate CompleteKeyStatisticsInformation if the worker does not have any row, it might return an empty snapshot but still in sequential merge, we need a timebucket <-> worker to transition the worker to the next stage.
We would end up maintaining a custom "worker Map" in complete key stats information for such snapshots which is no better to what I am doing . Hence I decided against that approach.

Can we add a test in MSQSelectTests that verify that this is working as expected?

Since we donot gather stats in select q's I cannot do this via MSQSelectTests.
For MSQInsertTest, the test case framework currently supports only one worker, I cannot stimulate a test case where one worker has data and the other does not without running into insert cannot be empty.

@LakshSingla
Copy link
Copy Markdown
Contributor

For MSQInsertTest, the test case framework currently supports only one worker, I cannot stimulate a test case where one worker has data and the other does not without running into insert cannot be empty.

I think @adarshsanjeev did make some progress in this department where we could simulate multiple workers easily. However, since this might be a blocker, we can punt it for later (though we should revisit this for sure).

Copy link
Copy Markdown
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/druid/pull/14574/files#diff-81d9fa4072caddb39cb3bbccb4d0f8814c262dce52557b05c5e93f10f6c16142R248-L261

I think we can remove this special handling since we are addressing this now more generally, so we won't need it.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Jul 12, 2023

Thanks for the fix! Could there be a test for this?

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Jul 12, 2023

Nevermind, I just noticed the discussion above about testing. This patch LGTM but let's try to get multi-worker unit testing going as soon as we can.

@cryptoe
Copy link
Copy Markdown
Contributor Author

cryptoe commented Jul 12, 2023

@gianm @LakshSingla I have added an IT for the same.

if (sqlTaskStatus.getState().isFailure()) {
Assert.fail(StringUtils.format(
"Unable to start the task successfully.\nPossible exception: %s",
sqlTaskStatus.getError()

Check notice

Code scanning / CodeQL

Use of default toString()

Default toString(): ErrorResponse inherits toString() from Object, and so is not suitable for printing.
@cryptoe
Copy link
Copy Markdown
Contributor Author

cryptoe commented Jul 12, 2023

Going ahead and merging this since the IT has passed :

[revised-its / it (8, MultiStageQuery, middleManager) / MultiStageQuery integration test (Compile=jdk8, Run=jdk8, Indexer=middleManager)](https://github.com/apache/druid/actions/runs/5533031309/jobs/10098107930#logs)

@cryptoe cryptoe merged commit 89aee6c into apache:master Jul 12, 2023
cryptoe added a commit that referenced this pull request Jul 12, 2023
* Fixing an issue in sequential merge where workers without any partial key statistics would get stuck because controller did not change the worker state.

* Removing empty check

* Adding IT for MSQ sequential bug fix.

(cherry picked from commit 89aee6c)
@LakshSingla
Copy link
Copy Markdown
Contributor

Thanks for the PR! 🚀

sergioferragut pushed a commit to sergioferragut/druid that referenced this pull request Jul 21, 2023
* Fixing an issue in sequential merge where workers without any partial key statistics would get stuck because controller did not change the worker state.

* Removing empty check

* Adding IT for MSQ sequential bug fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants