Skip to content

Add support for selective loading of broadcast datasources in the task layer#17027

Merged
abhishekrb19 merged 12 commits intoapache:masterfrom
abhishekrb19:selective-loading-broadcast-ds
Sep 12, 2024
Merged

Add support for selective loading of broadcast datasources in the task layer#17027
abhishekrb19 merged 12 commits intoapache:masterfrom
abhishekrb19:selective-loading-broadcast-ds

Conversation

@abhishekrb19
Copy link
Copy Markdown
Contributor

@abhishekrb19 abhishekrb19 commented Sep 10, 2024

Motivation

Currently, servers and tasks download all broadcast datasources. While the downloading of bootstrap segments is generally fast, as it happens in parallel during startup, it does consume storage space. This is especially relevant for tasks like kill tasks and MSQ worker tasks, which either don't need to load any broadcast datasources or only selectively load those that are required. This optimization will save task storage space and speed up the task startup time a bit.

Description

The design for loading broadcast datasources follows a similar approach to what was implemented for loading lookups in #16328.

Broadcast datasources can be specified in SQL queries through JOIN and FROM clauses, or obtained from other sources such as lookups.To this effect, we have introduced a BroadcastDatasourceLoadingSpec. Finding the set of broadcast datasources during SQL planning will be done in a follow-up, which will apply only to MSQ tasks, so they load only required broadcast datasources. This PR primarily focuses on the skeletal changes around BroadcastDatasourceLoadingSpec and integrating it from the Task interface via CliPeon to SegmentBootstrapper.

Changes

  • Added BroadcastDatasourceLoadingSpec class and Task.getBroadcastDatasourceLoadingSpec() tasks to control how broadcast datasources are loaded.
    • The supported broadcast datasource loading modes are ALL, NONE and ONLY_REQUIRED.
    • By default, all tasks will load all broadcast datasources.
    • The KillTask and MSQControllerTask override getBroadcastDatasourceLoadingSpec() to NONE.
  • The CliPeon initializes the broadcast module by parsing the command line option --loadBroadcastDatasourceMode.
  • The CLIPeon option --loadBroadcastSegments have been marked as deprecated in favor of --loadBroadcastDatasourceMode. This can be removed in a future release.
  • The PodTemplateTaskAdapter and K8sTaskAdapter support wiring for the option LOAD_BROADCAST_DATASOURCE_MODE, similar to existing LOAD_BROADCAST_SEGMENTS.

In summary:

  • Data servers load all broadcast datasources.
  • By default, all tasks load all broadcast datasources unless an override exists.
  • MSQ Controller and Kill tasks do not load any broadcast datasources.
  • In a future patch, MSQ ingest and query tasks will selectively load segments pertaining to the broadcast datasources determined as part of query planning. This will be addressed in a follow-up.

Release note:

Tasks control the loading of broadcast datasources via BroadcastDatasourceLoadingSpec getBroadcastDatasourceLoadingSpec(). By default, tasks download all broadcast datasources, unless there's an override as with kill and MSQ controller task.

The CLIPeon command line option --loadBroadcastSegments is deprecated in favor of --loadBroadcastDatasourceMode.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@github-actions github-actions Bot added Area - Batch Ingestion Kubernetes Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Sep 10, 2024
Comment thread indexing-service/src/main/java/org/apache/druid/indexing/common/task/Task.java Outdated
@abhishekrb19
Copy link
Copy Markdown
Contributor Author

The failing UTs are unrelated: other module test failed with OOM (ongoing investigation with heap dumps from #17029) and flaky CDS test.

@abhishekrb19 abhishekrb19 merged commit 5ef94c9 into apache:master Sep 12, 2024
@abhishekrb19 abhishekrb19 deleted the selective-loading-broadcast-ds branch September 12, 2024 17:30
cecemei pushed a commit to cecemei/druid that referenced this pull request Sep 12, 2024
…k layer (apache#17027)

Tasks control the loading of broadcast datasources via BroadcastDatasourceLoadingSpec getBroadcastDatasourceLoadingSpec(). By default, tasks download all broadcast datasources, unless there's an override as with kill and MSQ controller task.

The CLIPeon command line option --loadBroadcastSegments is deprecated in favor of --loadBroadcastDatasourceMode.

Broadcast datasources can be specified in SQL queries through JOIN and FROM clauses, or obtained from other sources such as lookups.To this effect, we have introduced a BroadcastDatasourceLoadingSpec. Finding the set of broadcast datasources during SQL planning will be done in a follow-up, which will apply only to MSQ tasks, so they load only required broadcast datasources. This PR primarily focuses on the skeletal changes around BroadcastDatasourceLoadingSpec and integrating it from the Task interface via CliPeon to SegmentBootstrapper.

Currently, only kill tasks and MSQ controller tasks skip loading broadcast datasources.
pranavbhole pushed a commit to pranavbhole/druid that referenced this pull request Sep 17, 2024
…k layer (apache#17027)

Tasks control the loading of broadcast datasources via BroadcastDatasourceLoadingSpec getBroadcastDatasourceLoadingSpec(). By default, tasks download all broadcast datasources, unless there's an override as with kill and MSQ controller task.

The CLIPeon command line option --loadBroadcastSegments is deprecated in favor of --loadBroadcastDatasourceMode.

Broadcast datasources can be specified in SQL queries through JOIN and FROM clauses, or obtained from other sources such as lookups.To this effect, we have introduced a BroadcastDatasourceLoadingSpec. Finding the set of broadcast datasources during SQL planning will be done in a follow-up, which will apply only to MSQ tasks, so they load only required broadcast datasources. This PR primarily focuses on the skeletal changes around BroadcastDatasourceLoadingSpec and integrating it from the Task interface via CliPeon to SegmentBootstrapper.

Currently, only kill tasks and MSQ controller tasks skip loading broadcast datasources.
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 1, 2024
kfaraz pushed a commit to kfaraz/druid that referenced this pull request Oct 1, 2024
…k layer (apache#17027)

Tasks control the loading of broadcast datasources via BroadcastDatasourceLoadingSpec getBroadcastDatasourceLoadingSpec(). By default, tasks download all broadcast datasources, unless there's an override as with kill and MSQ controller task.

The CLIPeon command line option --loadBroadcastSegments is deprecated in favor of --loadBroadcastDatasourceMode.

Broadcast datasources can be specified in SQL queries through JOIN and FROM clauses, or obtained from other sources such as lookups.To this effect, we have introduced a BroadcastDatasourceLoadingSpec. Finding the set of broadcast datasources during SQL planning will be done in a follow-up, which will apply only to MSQ tasks, so they load only required broadcast datasources. This PR primarily focuses on the skeletal changes around BroadcastDatasourceLoadingSpec and integrating it from the Task interface via CliPeon to SegmentBootstrapper.

Currently, only kill tasks and MSQ controller tasks skip loading broadcast datasources.
kfaraz added a commit that referenced this pull request Oct 1, 2024
…k layer (#17027) (#17206)

Tasks control the loading of broadcast datasources via BroadcastDatasourceLoadingSpec getBroadcastDatasourceLoadingSpec(). By default, tasks download all broadcast datasources, unless there's an override as with kill and MSQ controller task.

The CLIPeon command line option --loadBroadcastSegments is deprecated in favor of --loadBroadcastDatasourceMode.

Broadcast datasources can be specified in SQL queries through JOIN and FROM clauses, or obtained from other sources such as lookups.To this effect, we have introduced a BroadcastDatasourceLoadingSpec. Finding the set of broadcast datasources during SQL planning will be done in a follow-up, which will apply only to MSQ tasks, so they load only required broadcast datasources. This PR primarily focuses on the skeletal changes around BroadcastDatasourceLoadingSpec and integrating it from the Task interface via CliPeon to SegmentBootstrapper.

Currently, only kill tasks and MSQ controller tasks skip loading broadcast datasources.

Co-authored-by: Abhishek Radhakrishnan <abhishek.rb19@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Kubernetes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants