Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@sezruby
Copy link
Collaborator

@sezruby sezruby commented Dec 11, 2020

What is the context for this pull request?

  • Tracking Issue: n/a
  • Parent Issue: n/a
  • Dependencies: n/a

What changes were proposed in this pull request?

Introduce new tags to avoid duplicate operation while applying rules.

  • SIGNATURE_MATCHED tag keeps the result of def signatureValid for each relation & indexes, and reuse if exists. Since there's no config to affect the result, it can be reused without any check.
  • IS_HYBRIDSCAN_CANDIDATE' tag keeps the result of def isHybridScanCandidate` for each relation & indexes. If some hybrid scan related configs are changed, the result can be changed. So before using this tag, reset the tag values if related configs are changed.
  • related configs
    • "spark.hyperspace.index.hybridscan.maxDeletedRatio"
    • "spark.hyperspace.index.hybridscan.maxAppendedRatio"
  • HYBRIDSCAN_RELATED_CONFIGS keeps the related config values.
Test results

Test data

  • 100k chunk lineitem table
  • 1 non-candidate index (but with applicable column schema; so that we could measure the getCandidateIndex performance)
  • unit: ms

Test scripts

val linetable = spark.read.parquet(tableName)
val filter = linetable.filter(linetable("l_orderkey") isin (1234,12341234, 123456)).select("l_orderkey")
measure(filter.queryExecution.optimizedPlan)
val filter = linetable.filter(linetable("l_orderkey") isin (1234,12341234, 123456)).select("l_orderkey")
measure(filter.queryExecution.optimizedPlan)
val filter = linetable.filter(linetable("l_orderkey") isin (1234,12341234, 123456)).select("l_orderkey")
measure(filter.queryExecution.optimizedPlan)

Result w/o this change:

// Non Hybrid Scan
ilter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 1533
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 1359
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 1347

// Hybrid Scan
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 3505
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 1314
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 1392

Result with this change

// Non Hybrid Scan
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 1690
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 11
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 10

// Hybrid Scan
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 1190
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 313 // cost from constructing the list of FileInfo for the given plan
filter: org.apache.spark.sql.DataFrame = [l_orderkey: bigint]
duration: 306

Does this PR introduce any user-facing change?

No, it's an internal optimization.

How was this patch tested?

existing tests will cover these changes.

@sezruby sezruby self-assigned this Dec 11, 2020
@sezruby sezruby added the enhancement New feature or request label Dec 11, 2020
@imback82
Copy link
Contributor

Result (only this PR - including InMemoryFileIndex cost)

So, what's the summary? I wasn't sure which number to look at to see the gain. Also, is the duration in milliseconds?

@sezruby
Copy link
Collaborator Author

sezruby commented Jan 22, 2021

@imback82 I updated the test results with 2 different binaries.
The previous result was incorrect because of the cost for cachedIndexCollectionManager.

@imback82
Copy link
Contributor

Cool, the result makes more sense now. I will get to the review this weekend. @apoorvedave1 / @pirz can you review this PR meanwhile? Thanks!

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General approach seems fine to me.


indexes.foreach { index =>
val taggedConfigs = index.getTagValue(plan, HYBRIDSCAN_CONFIG_CAPTURE)
if (taggedConfigs.isEmpty || !taggedConfigs.get.equals(curConfigs)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer using map or foreach for this pattern taggedConfigs.isEmpty || taggedConfigs.get:

      index.getTagValue(plan, HYBRIDSCAN_CONFIG_CAPTURE).foreach { taggedConfigs =>
        if (taggedConfigs.equals(curConfigs)) {
          // Need to reset cached tags as these config changes can change the result.
          index.unsetTagValue(plan, IS_HYBRIDSCAN_CANDIDATE)
          index.setTagValue(plan, HYBRIDSCAN_CONFIG_CAPTURE, curConfigs)
        }
      }

@imback82
Copy link
Contributor

  • "spark.hyperspace.index.hybridscan.delete.enabled"
  • "spark.hyperspace.index.hybridscan.delete.maxNumDeletedFiles"

Can you update these, which are not in the code base anymore?

isCandidate
}

def prepareHybridScanCandidateSelection(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but we need to think about refactoring RuleUtils.scala. This is already two levels deep (getCandidateIndexes -> isHybridScanCandidate -> prepareHybridScanCandidateSelection).

@imback82
Copy link
Contributor

  • "spark.hyperspace.index.hybridscan.delete.enabled"
  • "spark.hyperspace.index.hybridscan.delete.maxNumDeletedFiles"

Can you update these, which are not in the code base anymore?

Btw, this comment was regarding the PR description.

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (few minor comments), thanks @sezruby!

@imback82
Copy link
Contributor

@sezruby Can you update the PR description? It is now out of sync with code. Thanks!

@imback82 imback82 merged commit 7f36568 into microsoft:master Jan 26, 2021
@imback82 imback82 added this to the January 2021 milestone Jan 29, 2021
@sezruby sezruby deleted the addnewtags branch April 16, 2021 02:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants