Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@sezruby
Copy link
Collaborator

@sezruby sezruby commented Oct 21, 2020

What is the context for this pull request?

This PR is one possible implementation of the first approach in #222.

What changes were proposed in this pull request?

In order to optimize rule application process, this PR introduces IndexLogEntryTag which can store auxiliary
data while applying rules and defines a tag for Hybrid Scan:

  // HYBRIDSCAN_REQUIRED indicates if Hybrid Scan is required for this index or not.
  // This is set in getCandidateIndexes and utilized in transformPlanToUseIndex.
  val HYBRIDSCAN_REQUIRED: IndexLogEntryTag[Boolean] =
    IndexLogEntryTag[Boolean]("hybridScanRequired")

Tags are stored in each IndexLogEntry instances which can be accessed by indexManager.getIndexes.
Therefore the lifetime of each tag is the same of each index instance, which means if the indexManager returns cached index log entries, tags are available and reusable for other queries while the cached log entries are available.

ref) tag implementation from Spark 3.0 - https://github.com/apache/spark/blob/eb9966b70055a67dd02451c78ec205d913a38a42/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L93

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

@sezruby sezruby self-assigned this Oct 21, 2020
@sezruby sezruby added the enhancement New feature or request label Oct 21, 2020
@sezruby sezruby added this to the November 2020 milestone Oct 21, 2020
@imback82 imback82 modified the milestones: November 2020, October 2020 Oct 22, 2020
@sezruby
Copy link
Collaborator Author

sezruby commented Oct 24, 2020

@imback82 @pirz @apoorvedave1 Please review this change when you have the time. Thanks!

@imback82
Copy link
Contributor

Sorry for the delay. I will try to get to this this weekend.

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sezruby Could you rebase #164 against this PR? I want to check what the tag usage will be like before closing this one. Thanks!

Comment on lines 59 to 64
// Tags are used while applying rules.
// INDEX_HYBRIDSCAN_REQUIRED_TAG indicates if Hybrid Scan is required for this index or not.
// This is set in getCandidateIndexes and utilized in transformPlanToUseIndex.
val INDEX_HYBRIDSCAN_REQUIRED_TAG: IndexLogEntryTag[Boolean] =
IndexLogEntryTag[Boolean]("hybridScanRequired")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is getting out of control with random constants. Can you create IndexLogEntryTags and put tag related constants there? I should be able to use as IndexLogEntryTags.HYBRID_SCAN_REQUIRED, for example.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imback82 I updated #164 based on this PR. Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imback82 Could you check the change and merge this change? Thanks!

@rapoth rapoth modified the milestones: October 2020, November 2020 Oct 29, 2020
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for few nits, thanks @sezruby!

object IndexLogEntryTags {
// INDEX_HYBRIDSCAN_REQUIRED_TAG indicates if Hybrid Scan is required for this index or not.
// This is set in getCandidateIndexes and utilized in transformPlanToUseIndex.
val INDEX_HYBRIDSCAN_REQUIRED_TAG: IndexLogEntryTag[Boolean] =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just call this HYBRIDSCAN_REQUIRED? IndexLogEntryTags.HYBRIDSCAN_REQUIRED sounds fine?

@imback82
Copy link
Contributor

@sezruby Can you update the description with the lifetime of tags with the current code? (e.g., per rule)

@sezruby
Copy link
Collaborator Author

sezruby commented Nov 13, 2020

@imback82 Updated pr description. Thanks!

@imback82
Copy link
Contributor

Tags are stored in each IndexLogEntry instances created by indexManager.getIndexes.
Therefore the lifetime of each tag is the same of each index instance which means tags are only usable each during rule application of each candidate plan (Filter, Join).

The index manager could be CachingIndexCollectionManager. In this case, tags would be shared across rules, etc., right? Is that the intention? (Sorry, I didn't fully understand this part: "tags are only usable each during rule application of each candidate plan (Filter, Join).")

@sezruby
Copy link
Collaborator Author

sezruby commented Nov 13, 2020

"each rule application" means if there are 3 filter nodes, there will be 3 tries of rule application. Tags are not shared between them because indexManager.getIndexes returns "cloned" index log entries.
So even it's not shared within a rule application in case Join left & right.

Sorry I don't know why I thought it's "cloned"; this's the reason I didn't use "df" as a part of key of the map at first. 😖
But anyway, if cached index entries are used, tags can be shared across rules and also queries! I think this is GREAT though it's not intended 😁

Then, should we add "max number of caching entires" or is it enough to rely on INDEX_CACHE_EXPIRY_DURATION_SECONDS (default 300s)? Or even we could reuse the index log entry (with tags) if there's no update on the index.
If needed, it would be better to work with another PR as this is a dependent pr for #164.

@imback82
Copy link
Contributor

Sorry I don't know why I thought it's "cloned"; this's the reason I didn't use "df" as a part of key of the map at first. 😖
But anyway, if cached index entries are used, tags can be shared across rules and also queries! I think this is GREAT though it's not intended 😁

Great. As long as the cached tags do not affect other rules, it's good for now.

Then, should we add "max number of caching entires" or is it enough to rely on INDEX_CACHE_EXPIRY_DURATION_SECONDS (default 300s)? Or even we could reuse the index log entry (with tags) if there's no update on the index.
If needed, it would be better to work with another PR as this is a dependent pr for #164.

Yea, let's think about this separately.

Merging to master!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants