Skip to content

Fix data race in getting results from MSQ select tasks.#16107

Merged
cryptoe merged 3 commits intoapache:masterfrom
cryptoe:sql_select_fix
Mar 13, 2024
Merged

Fix data race in getting results from MSQ select tasks.#16107
cryptoe merged 3 commits intoapache:masterfrom
cryptoe:sql_select_fix

Conversation

@cryptoe
Copy link
Copy Markdown
Contributor

@cryptoe cryptoe commented Mar 12, 2024

With #15880 going in, it exposed a potential data race where the MSQ controller would post finish to the workers but still fetch the results from them.
The change pushes the result fetching before we issue a worker finish.
As a result of this change, now we would materialize all the rows of the select result in the controller memory. This is okay when selectDestination=durableStorage but a little problematic when selectDestination=taskReport since the former has a limit of 3000 rows but the later can go wild.

The other way would be to hold of finishing the tasks till we write out the task report.

@github-actions github-actions Bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Mar 12, 2024
resultsYielder.close();
}
catch (IOException e) {
throw new RuntimeException("Unable to fetch results of various worker tasks successfully", e);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be DruidException

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can probably refactor this part of a another PR.

@gianm gianm added this to the 29.0.1 milestone Mar 12, 2024
Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I tagged this with 29.0.1 since it's a results truncation issue, which is something we should fix in the patch release. I would also support backporting #15880 to 29.0.1.

@cryptoe cryptoe merged commit 84c5098 into apache:master Mar 13, 2024
cryptoe added a commit to cryptoe/druid that referenced this pull request Mar 18, 2024
* Fix data race in getting results from MSQ select tasks.

* Add better logging

* Handling number overflow.
LakshSingla pushed a commit that referenced this pull request Mar 19, 2024
)

* Fix data race in getting results from MSQ select tasks.

* Add better logging

* Handling number overflow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants