Add new IndexLogEntryTag to avoid duplicate calculation in getCandidateIndexes #293

sezruby · 2020-12-11T09:00:44Z

What is the context for this pull request?

Tracking Issue: n/a
Parent Issue: n/a
Dependencies: n/a

What changes were proposed in this pull request?

Introduce new tags to avoid duplicate operation while applying rules.

SIGNATURE_MATCHED tag keeps the result of def signatureValid for each relation & indexes, and reuse if exists. Since there's no config to affect the result, it can be reused without any check.
IS_HYBRIDSCAN_CANDIDATE' tag keeps the result of def isHybridScanCandidate` for each relation & indexes. If some hybrid scan related configs are changed, the result can be changed. So before using this tag, reset the tag values if related configs are changed.
related configs
- "spark.hyperspace.index.hybridscan.maxDeletedRatio"
- "spark.hyperspace.index.hybridscan.maxAppendedRatio"
HYBRIDSCAN_RELATED_CONFIGS keeps the related config values.

Test results

Test data

100k chunk lineitem table
1 non-candidate index (but with applicable column schema; so that we could measure the getCandidateIndex performance)
unit: ms

Test scripts

val linetable = spark.read.parquet(tableName)
val filter = linetable.filter(linetable("l_orderkey") isin (1234,12341234, 123456)).select("l_orderkey")
measure(filter.queryExecution.optimizedPlan)
val filter = linetable.filter(linetable("l_orderkey") isin (1234,12341234, 123456)).select("l_orderkey")
measure(filter.queryExecution.optimizedPlan)
val filter = linetable.filter(linetable("l_orderkey") isin (1234,12341234, 123456)).select("l_orderkey")
measure(filter.queryExecution.optimizedPlan)

Result w/o this change:

// Non Hybrid Scan
ilter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 1533
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 1359
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 1347

// Hybrid Scan
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 3505
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 1314
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 1392

Result with this change

// Non Hybrid Scan
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 1690
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 11
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 10

// Hybrid Scan
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 1190
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 313 // cost from constructing the list of FileInfo for the given plan
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 306

Does this PR introduce any user-facing change?

No, it's an internal optimization.

How was this patch tested?

existing tests will cover these changes.

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntryTags.scala

…Indexes

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntryTags.scala

imback82 · 2021-01-21T18:40:46Z

Result (only this PR - including InMemoryFileIndex cost)

So, what's the summary? I wasn't sure which number to look at to see the gain. Also, is the duration in milliseconds?

sezruby · 2021-01-22T04:16:00Z

@imback82 I updated the test results with 2 different binaries.
The previous result was incorrect because of the cost for cachedIndexCollectionManager.

imback82 · 2021-01-22T04:24:26Z

Cool, the result makes more sense now. I will get to the review this weekend. @apoorvedave1 / @pirz can you review this PR meanwhile? Thanks!

imback82

General approach seems fine to me.

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntryTags.scala

imback82 · 2021-01-24T03:03:37Z

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntryTags.scala

+
+    indexes.foreach { index =>
+      val taggedConfigs = index.getTagValue(plan, HYBRIDSCAN_CONFIG_CAPTURE)
+      if (taggedConfigs.isEmpty || !taggedConfigs.get.equals(curConfigs)) {


nit: prefer using map or foreach for this pattern taggedConfigs.isEmpty || taggedConfigs.get:

index.getTagValue(plan, HYBRIDSCAN_CONFIG_CAPTURE).foreach { taggedConfigs => if (taggedConfigs.equals(curConfigs)) { // Need to reset cached tags as these config changes can change the result. index.unsetTagValue(plan, IS_HYBRIDSCAN_CANDIDATE) index.setTagValue(plan, HYBRIDSCAN_CONFIG_CAPTURE, curConfigs) } }

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntryTags.scala

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntryTags.scala

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

imback82 · 2021-01-24T03:37:16Z

"spark.hyperspace.index.hybridscan.delete.enabled"

"spark.hyperspace.index.hybridscan.delete.maxNumDeletedFiles"

Can you update these, which are not in the code base anymore?

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntryTags.scala

imback82 · 2021-01-26T00:14:20Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

      isCandidate
    }

+    def prepareHybridScanCandidateSelection(


Not related to this PR, but we need to think about refactoring RuleUtils.scala. This is already two levels deep (getCandidateIndexes -> isHybridScanCandidate -> prepareHybridScanCandidateSelection).

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala

imback82 · 2021-01-26T00:37:00Z

"spark.hyperspace.index.hybridscan.delete.enabled"

"spark.hyperspace.index.hybridscan.delete.maxNumDeletedFiles"

Can you update these, which are not in the code base anymore?

Btw, this comment was regarding the PR description.

imback82

LGTM (few minor comments), thanks @sezruby!

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

src/test/scala/com/microsoft/hyperspace/index/rules/RuleUtilsTest.scala

imback82 · 2021-01-26T07:05:29Z

@sezruby Can you update the PR description? It is now out of sync with code. Thanks!

sezruby commented Dec 11, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntryTags.scala Outdated Show resolved Hide resolved

sezruby requested review from apoorvedave1, imback82 and pirz December 11, 2020 09:06

sezruby self-assigned this Dec 11, 2020

sezruby added the enhancement New feature or request label Dec 11, 2020

Add new IndexLogEntryTag to avoid duplicate operation in getCandidate…

a073095

…Indexes

sezruby force-pushed the addnewtags branch from 7f0c7a3 to a073095 Compare December 12, 2020 02:50

pirz reviewed Dec 15, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntryTags.scala Show resolved Hide resolved

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntryTags.scala Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/master' into addnewtags

fa06578

Review commit

a99ca5b

imback82 reviewed Jan 24, 2021

View reviewed changes

sezruby mentioned this pull request Jan 25, 2021

Enable Hybrid Scan by default #333

Open

imback82 reviewed Jan 26, 2021

View reviewed changes

sezruby and others added 4 commits January 26, 2021 13:06

Review commit

651accf

Review commit2

93862b9

review commit

330b0de

Merge remote-tracking branch 'upstream/master' into addnewtags

072648f

sezruby force-pushed the addnewtags branch from e2b9274 to fdafa7f Compare January 26, 2021 05:08

review commit

d7dc82c

sezruby force-pushed the addnewtags branch from fdafa7f to d7dc82c Compare January 26, 2021 05:14

imback82 reviewed Jan 26, 2021

View reviewed changes

review commit

e2c41ad

imback82 approved these changes Jan 26, 2021

View reviewed changes

imback82 merged commit 7f36568 into microsoft:master Jan 26, 2021

imback82 added this to the January 2021 milestone Jan 29, 2021

sezruby deleted the addnewtags branch April 16, 2021 02:30

Add new IndexLogEntryTag to avoid duplicate calculation in getCandidateIndexes #293

Add new IndexLogEntryTag to avoid duplicate calculation in getCandidateIndexes #293

Uh oh!

Conversation

sezruby commented Dec 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the context for this pull request?

What changes were proposed in this pull request?

Test results

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imback82 commented Jan 21, 2021

Uh oh!

sezruby commented Jan 22, 2021

Uh oh!

imback82 commented Jan 22, 2021

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imback82 Jan 24, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imback82 commented Jan 24, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imback82 Jan 26, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imback82 commented Jan 26, 2021

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imback82 commented Jan 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sezruby commented Dec 11, 2020 •

edited

Loading