Introduce IndexLogEntryTag for auxiliary data while applying rules #223

sezruby · 2020-10-21T04:54:03Z

What is the context for this pull request?

Tracking Issue: Introduce auxiliary data structure for applying rules #222
Parent Issue: n/a
Dependencies: n/a

This PR is one possible implementation of the first approach in #222.

What changes were proposed in this pull request?

In order to optimize rule application process, this PR introduces IndexLogEntryTag which can store auxiliary
data while applying rules and defines a tag for Hybrid Scan:

  // HYBRIDSCAN_REQUIRED indicates if Hybrid Scan is required for this index or not.
  // This is set in getCandidateIndexes and utilized in transformPlanToUseIndex.
  val HYBRIDSCAN_REQUIRED: IndexLogEntryTag[Boolean] =
    IndexLogEntryTag[Boolean]("hybridScanRequired")

Tags are stored in each IndexLogEntry instances which can be accessed by indexManager.getIndexes.
Therefore the lifetime of each tag is the same of each index instance, which means if the indexManager returns cached index log entries, tags are available and reusable for other queries while the cached log entries are available.

ref) tag implementation from Spark 3.0 - https://github.com/apache/spark/blob/eb9966b70055a67dd02451c78ec205d913a38a42/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L93

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

sezruby · 2020-10-24T01:03:42Z

@imback82 @pirz @apoorvedave1 Please review this change when you have the time. Thanks!

imback82 · 2020-10-24T01:09:10Z

Sorry for the delay. I will try to get to this this weekend.

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala

imback82

@sezruby Could you rebase #164 against this PR? I want to check what the tag usage will be like before closing this one. Thanks!

imback82 · 2020-10-24T16:29:58Z

src/main/scala/com/microsoft/hyperspace/index/IndexConstants.scala

+  // Tags are used while applying rules.
+  // INDEX_HYBRIDSCAN_REQUIRED_TAG indicates if Hybrid Scan is required for this index or not.
+  // This is set in getCandidateIndexes and utilized in transformPlanToUseIndex.
+  val INDEX_HYBRIDSCAN_REQUIRED_TAG: IndexLogEntryTag[Boolean] =
+  IndexLogEntryTag[Boolean]("hybridScanRequired")
+


This file is getting out of control with random constants. Can you create IndexLogEntryTags and put tag related constants there? I should be able to use as IndexLogEntryTags.HYBRID_SCAN_REQUIRED, for example.

@imback82 I updated #164 based on this PR. Thanks!

@imback82 Could you check the change and merge this change? Thanks!

imback82

LGTM except for few nits, thanks @sezruby!

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

imback82 · 2020-11-13T05:15:06Z

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntryTags.scala

+object IndexLogEntryTags {
+  // INDEX_HYBRIDSCAN_REQUIRED_TAG indicates if Hybrid Scan is required for this index or not.
+  // This is set in getCandidateIndexes and utilized in transformPlanToUseIndex.
+  val INDEX_HYBRIDSCAN_REQUIRED_TAG: IndexLogEntryTag[Boolean] =


nit: just call this HYBRIDSCAN_REQUIRED? IndexLogEntryTags.HYBRIDSCAN_REQUIRED sounds fine?

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

imback82 · 2020-11-13T06:44:30Z

@sezruby Can you update the description with the lifetime of tags with the current code? (e.g., per rule)

sezruby · 2020-11-13T07:07:31Z

@imback82 Updated pr description. Thanks!

imback82 · 2020-11-13T07:26:42Z

Tags are stored in each IndexLogEntry instances created by indexManager.getIndexes.
Therefore the lifetime of each tag is the same of each index instance which means tags are only usable each during rule application of each candidate plan (Filter, Join).

The index manager could be CachingIndexCollectionManager. In this case, tags would be shared across rules, etc., right? Is that the intention? (Sorry, I didn't fully understand this part: "tags are only usable each during rule application of each candidate plan (Filter, Join).")

sezruby · 2020-11-13T08:23:28Z

"each rule application" means if there are 3 filter nodes, there will be 3 tries of rule application. Tags are not shared between them because indexManager.getIndexes returns "cloned" index log entries.
So even it's not shared within a rule application in case Join left & right.

Sorry I don't know why I thought it's "cloned"; this's the reason I didn't use "df" as a part of key of the map at first. 😖
But anyway, if cached index entries are used, tags can be shared across rules and also queries! I think this is GREAT though it's not intended 😁

Then, should we add "max number of caching entires" or is it enough to rely on INDEX_CACHE_EXPIRY_DURATION_SECONDS (default 300s)? Or even we could reuse the index log entry (with tags) if there's no update on the index.
If needed, it would be better to work with another PR as this is a dependent pr for #164.

imback82 · 2020-11-13T18:09:27Z

Sorry I don't know why I thought it's "cloned"; this's the reason I didn't use "df" as a part of key of the map at first. 😖
But anyway, if cached index entries are used, tags can be shared across rules and also queries! I think this is GREAT though it's not intended 😁

Great. As long as the cached tags do not affect other rules, it's good for now.

Then, should we add "max number of caching entires" or is it enough to rely on INDEX_CACHE_EXPIRY_DURATION_SECONDS (default 300s)? Or even we could reuse the index log entry (with tags) if there's no update on the index.
If needed, it would be better to work with another PR as this is a dependent pr for #164.

Yea, let's think about this separately.

Merging to master!

Introduce IndexLogEntryTag for auxiliary data while applying rules

0c66577

sezruby mentioned this pull request Oct 21, 2020

Fix rank algorithm for Hybrid Scan #164

Merged

sezruby requested review from apoorvedave1, imback82, pirz and rapoth October 21, 2020 04:56

sezruby self-assigned this Oct 21, 2020

sezruby added the enhancement New feature or request label Oct 21, 2020

sezruby added this to the November 2020 milestone Oct 21, 2020

imback82 modified the milestones: November 2020, October 2020 Oct 22, 2020

sezruby added 2 commits October 24, 2020 09:38

Merge remote-tracking branch 'upstream/master' into indexlogtag

dfff9dc

minor fix

4a45320