Skip to content

Load only the required lookups for MSQ tasks#16358

Merged
cryptoe merged 14 commits intoapache:masterfrom
Akshat-Jain:load-selective-lookups-for-msq-tasks
May 9, 2024
Merged

Load only the required lookups for MSQ tasks#16358
cryptoe merged 14 commits intoapache:masterfrom
Akshat-Jain:load-selective-lookups-for-msq-tasks

Conversation

@Akshat-Jain
Copy link
Copy Markdown
Contributor

@Akshat-Jain Akshat-Jain commented Apr 30, 2024

Description

With this PR changes, MSQ tasks (MSQControllerTask and MSQWorkerTask) only load the required lookups during querying and ingestion, based on the value of CTX_LOOKUPS_TO_LOAD key in the query context.

Test plan

Apart from adding unit tests, the following manual testing was done for the different lookup loading modes.

Lookup loading mode = ONLY_REQUIRED

I verified that only the required lookups were being loaded in the following operations:

  1. MSQ query using non-reversible lookup in LOOKUP() function.
select * from "druid"."test-ds-1" where LOOKUP(name, 'lookupname') is not null
  1. MSQ query where lookup table is being used as a datasource.
SELECT * FROM "lookup"."lookupname"
  1. MSQ ingestion which references lookups.
REPLACE INTO "inline_data" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"inline","data":"{\"time\": \"2015-09-12T00:46:58.771Z\", \"name\": \"Adarsh\", \"rollNumber\": 1, \"country\": \"India\", \"age\": 20, \"grade\": \"A\"}\n{\"time\": \"2015-09-12T00:46:58.771Z\", \"name\": \"Ajith\", \"rollNumber\": 2, \"country\": \"India\", \"age\": 22, \"grade\": \"B\"}\n{\"time\": \"2015-09-13T00:46:58.771Z\", \"name\": \"Akshat\", \"rollNumber\": 3, \"country\": \"India\", \"age\": 24, \"grade\": \"A\"}\n{\"time\": \"2015-09-14T00:46:58.771Z\", \"name\": \"Amit\", \"rollNumber\": 4, \"country\": \"India\", \"age\": 26, \"grade\": \"C\"}\n{\"time\": \"2015-09-14T00:46:58.771Z\", \"name\": \"Ankit Kumar\", \"rollNumber\": 5, \"country\": \"India\", \"age\": 26, \"grade\": \"D\"}\n{\"time\": \"2015-09-15T00:46:58.771Z\", \"name\": \"Ankit Singh\", \"rollNumber\": 6, \"country\": \"India\", \"age\": 26, \"grade\": \"E\"}"}',
      '{"type":"json"}'
    )
  ) EXTEND ("time" VARCHAR, "name" VARCHAR, "rollNumber" BIGINT, "country" VARCHAR, "age" BIGINT, "grade" VARCHAR)
)
SELECT
  TIME_PARSE("time") AS "__time",
  "name",
  "rollNumber",
  "country",
  "age",
  "grade",
  (select LOOKUP('name', 'lookupname')) AS "new_column"
FROM "ext"
PARTITIONED BY DAY
  1. Join query using lookups.
SELECT * FROM "test-ds-1"
JOIN
"lookup"."lookupname"
on "test-ds-1"."name" = "lookupname"."k"

Lookup loading mode = NONE

I verified that no lookups were being loaded in the following operations:

  1. MSQ query using reversible lookup in LOOKUP() function.
select * from "druid"."test-ds-1" where LOOKUP(name, 'lookupname') = '1'
  1. MSQ query that doesn't reference lookups.
select * from "druid"."test-ds-1"
  1. MSQ ingestion which doesn't reference lookups.
REPLACE INTO "inline_data_2" OVERWRITE ALL
WITH "ext" AS (
  SELECT *
  FROM TABLE(
    EXTERN(
      '{"type":"inline","data":"{\"time\": \"2015-09-12T00:46:58.771Z\", \"name\": \"Adarsh\", \"rollNumber\": 1, \"country\": \"India\", \"age\": 20, \"grade\": \"A\"}\n{\"time\": \"2015-09-12T00:46:58.771Z\", \"name\": \"Ajith\", \"rollNumber\": 2, \"country\": \"India\", \"age\": 22, \"grade\": \"B\"}\n{\"time\": \"2015-09-13T00:46:58.771Z\", \"name\": \"Akshat\", \"rollNumber\": 3, \"country\": \"India\", \"age\": 24, \"grade\": \"A\"}\n{\"time\": \"2015-09-14T00:46:58.771Z\", \"name\": \"Amit\", \"rollNumber\": 4, \"country\": \"India\", \"age\": 26, \"grade\": \"C\"}\n{\"time\": \"2015-09-14T00:46:58.771Z\", \"name\": \"Ankit Kumar\", \"rollNumber\": 5, \"country\": \"India\", \"age\": 26, \"grade\": \"D\"}\n{\"time\": \"2015-09-15T00:46:58.771Z\", \"name\": \"Ankit Singh\", \"rollNumber\": 6, \"country\": \"India\", \"age\": 26, \"grade\": \"E\"}"}',
      '{"type":"json"}'
    )
  ) EXTEND ("time" VARCHAR, "name" VARCHAR, "rollNumber" BIGINT, "country" VARCHAR, "age" BIGINT, "grade" VARCHAR)
)
SELECT
  TIME_PARSE("time") AS "__time",
  "name",
  "rollNumber",
  "country",
  "age",
  "grade"
FROM "ext"
PARTITIONED BY DAY

Lookup loading mode = ALL

These are operations that aren't touched by this PR. I verified that all lookups were being loaded in the following operations:

  1. Non-MSQ batch ingestion.
  2. Tasks like compaction.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions github-actions Bot added Area - Batch Ingestion Area - Querying Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Apr 30, 2024
@Akshat-Jain Akshat-Jain marked this pull request as draft April 30, 2024 13:59
@Akshat-Jain Akshat-Jain force-pushed the load-selective-lookups-for-msq-tasks branch from eb435a5 to 32c1fa4 Compare May 2, 2024 08:23
@Akshat-Jain Akshat-Jain changed the title Load selective lookups for MSQ tasks Load only the required lookups for MSQ tasks May 2, 2024
@Akshat-Jain Akshat-Jain marked this pull request as ready for review May 2, 2024 10:04
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, @Akshat-Jain ! Left some feedback.

@Akshat-Jain Akshat-Jain force-pushed the load-selective-lookups-for-msq-tasks branch from ff669a7 to 0149079 Compare May 3, 2024 09:13
@Akshat-Jain
Copy link
Copy Markdown
Contributor Author

Based on offline discussion, I have modified the approach to populate a new field Set<String> lookupsToLoad in PlannerContext, instead of populating the queryContext. The new field is then used to pass info directly to MSQControllerTask when initializing it in MSQTaskQueryMaker.

The primary rationale behind this change is that it allows us to limit the scope of our changes to MSQ, since queryContext is a widely used field in a lot of areas.

Comment thread sql/src/main/java/org/apache/druid/sql/calcite/planner/PlannerContext.java Outdated
@Akshat-Jain Akshat-Jain requested review from cryptoe and kfaraz May 6, 2024 07:18
@Akshat-Jain Akshat-Jain force-pushed the load-selective-lookups-for-msq-tasks branch from a2727b1 to 191eb42 Compare May 7, 2024 06:16
return LookupLoadingSpec.NONE;
} else if (lookupLoadingMode == LookupLoadingSpec.Mode.ONLY_REQUIRED) {
List<String> lookupsToLoad = (List<String>) getContext().get(PlannerContext.CTX_LOOKUPS_TO_LOAD);
if (lookupsToLoad == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check for empty too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically ONLY_REQUIRED can use an empty list, so it's fine?
We could technically merge NONE and ONLY_REQUIRED from a functional point of view. but I like having them separate from a user consumption point of view.
Thoughts?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. Since we have decided to keep them distinct, ONLY_REQUIRED must always have a non-empty list.

The other alternative would be to get rid of the enum value NONE altogether and then the lists can be empty. But, as you said, I like having NONE too, as it clarifies the intent much better.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have made the change.

Comment thread sql/src/main/java/org/apache/druid/sql/calcite/planner/PlannerContext.java Outdated
Comment thread sql/src/main/java/org/apache/druid/sql/calcite/planner/PlannerContext.java Outdated
}

/**
* Returns the lookup loading spec for a given task.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Returns the lookup loading spec for a given task.
* Lookup loading spec for MSQ tasks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not limited to MSQ tasks though? Right now we consume it only in MSQ, but there's no restriction to not use this for other tasks in future.

Comment thread sql/src/main/java/org/apache/druid/sql/calcite/planner/PlannerContext.java Outdated
@Akshat-Jain
Copy link
Copy Markdown
Contributor Author

@kfaraz @cryptoe Have made some changes in the latest commit bb4a093. Summarizing them here:

  1. Use mutable Set as the class variable in LookupLoadingSpec. LookupLoadingSpec#getLookupsToLoad still returns an immutable copy of the set of lookups.
  2. Add LookupLoadingSpec#addLookupToLoad method to abstract away this logic from PlannerContext, and to avoid creating a new instance of LookupLoadingSpec everytime we have to add a new lookup for loading in PlannerContext. To achieve this, I had to make the fields non-final in LookupLoadingSpec.
  3. Add LookupLoadingSpec#createSpecFromMode method to create a new/different instance of LookupLoadingSpec for every instance of PlannerContext. Previously it was defaulting to LookupLoadingSpec.NONE, which is incorrect as it points to the same instance of LookupLoadingSpec across different instances of PlannerContext.

Hope this works? Appreciate your thoughts on these changes, thanks!

Copy link
Copy Markdown
Contributor

@adarshsanjeev adarshsanjeev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calcite side changes look good to me

@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented May 7, 2024

Previously it was defaulting to LookupLoadingSpec.NONE, which is incorrect as it points to the same instance of LookupLoadingSpec across different instances of PlannerContext.

Why is this a problem? There is no harm in pointing to the same instance if the instance is immutable.
It was written this way on purpose. We don't need the method LookupLoadingSpec.createSpecFromMode().

Comment thread server/src/main/java/org/apache/druid/server/lookup/cache/LookupLoadingSpec.java Outdated
@Akshat-Jain
Copy link
Copy Markdown
Contributor Author

Why is this a problem? There is no harm in pointing to the same instance if the instance is immutable.

@kfaraz We can't add any lookups to load to LookupLoadingSpec.NONE as that ends up updating the constant itself. What's the suggestion to deal with such issues?

@Akshat-Jain
Copy link
Copy Markdown
Contributor Author

@kfaraz Have made the change to use Set<String> in PlannerContext to store the lookups to load. I'm assuming that your previous comment was based on this suggestion, please let me know if I'm misunderstanding something. 😅
Thanks!

@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented May 7, 2024

@kfaraz Have made the change to use Set in PlannerContext to store the lookups to load. I'm assuming that your previous comment was based on this suggestion, please let me know if I'm misunderstanding something. 😅

Yes, this is what I had meant 🙂 .

Copy link
Copy Markdown
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes.
Overall LGTM
Thanks @Akshat-Jain for the patch.

Comment thread sql/src/main/java/org/apache/druid/sql/calcite/planner/PlannerContext.java Outdated
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me.

My only concern is why we can't use Query.equals() in the test assertions anymore. All other comments are non-blockers and may be addressed later.

Comment on lines +1047 to +1059
List<String> lookupsToLoad = (List<String>) context.get(PlannerContext.CTX_LOOKUPS_TO_LOAD);
if (expectedLookupLoadingSpec != null) {
Assert.assertEquals(expectedLookupLoadingSpec.getMode().toString(), lookupLoadingMode);
if (expectedLookupLoadingSpec.getMode().equals(LookupLoadingSpec.Mode.ONLY_REQUIRED)) {
Assert.assertEquals(new ArrayList<>(expectedLookupLoadingSpec.getLookupsToLoad()), lookupsToLoad);
} else {
Assert.assertNull(lookupsToLoad);
}
} else {
Assert.assertEquals(LookupLoadingSpec.Mode.NONE.toString(), lookupLoadingMode);
Assert.assertNull(lookupsToLoad);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than this if-else chain, we can just build a LookupLoadingSpec object from the values in the context and then do an equals check with the expected lookup loading spec. You would also need to override equals and hashCode in LookupLoadingSpec.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try taking care of this in the next PR with compaction changes, hope that works.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, thanks!

@cryptoe cryptoe merged commit 775d654 into apache:master May 9, 2024
gianm pushed a commit to gianm/druid that referenced this pull request May 10, 2024
With this PR changes, MSQ tasks (MSQControllerTask and MSQWorkerTask) only load the required lookups during querying and ingestion, based on the value of CTX_LOOKUPS_TO_LOAD key in the query context.
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants