Incremental Refresh support with scan patterns #298

apoorvedave1 · 2020-12-15T19:52:09Z

What is the context for this pull request?

Note to reviewers: Please follow the tracking issue for general algorithm used.

Tracking Issue: [PROPOSAL]: Support for Scan Pattern for Scalable Refresh Index for Large Data Sources #276
Parent Issue: [FEATURE REQUEST]: Scalable Refresh Index for Large Data Sources #233
Dependencies: Support incremental refresh index with hive-partition columns #281

What changes were proposed in this pull request?

This PR supports incremental index refresh when scan patterns are provided

hyperspace.refreshIndex("indexName", scanPattern="<scan-pattern>")

Therefore, the user would use this API in one of the following ways:

Data change in a particular partition:
refreshIndex("idx1", "incremental", scanPattern=Some("data/year=2020/month=10/day=29/hour=03/*"))
Data change in a particular hierarchy:
refreshIndex("idx1", "incremental", scanPattern=Some("data/year=2020/month=07/*"))

Supported scenarios:

// Case 1: Hive Partitioned Data
val df = spark.read.format(...).load("/path/to/data/")
hs.createIndex(..., IndexConfig("index", ...)
hs.refreshIndex("index", "incremental", Some("data/*"))  // everything
hs.refreshIndex("index", "incremental", Some("data/y=2020/*")) // everything from y=2020
hs.refreshIndex("index", "incremental", Some("data/y=*/m=10")) // everything where y = anything, m = 10
hs.refreshIndex("index", "incremental", Some("data/y=20*/m=10/")) // everything where y starts with 20, m = 10

// Case 2: Globbing Pattern Data
val df = spark.read.format(...).option("spark.hyperspace.source.globbingPattern", "/path/to/data/*/*/*").load("/path/to/data/*/*/*") // Globbing pattern option as per #269 
hs.createIndex(..., IndexConfig("index", ...)
hs.refreshIndex("index", "incremental", Some("data/*"))  // everything
hs.refreshIndex("index", "incremental", Some("data/a/*")) // everything from data/a
hs.refreshIndex("index", "incremental", Some("data/a/b/*")) // everything from data/a/b
hs.refreshIndex("index", "incremental", Some("data/*/b/")) // every folder at depth 2 which ends with b

Does this PR introduce any user-facing change?

Yes we introduce new refresh apis which can take scan pattern.

Constraints:
The scan pattern MUST start with the data folder which was picked for indexing. In the above examples, the scan pattern starts with data which is the base path for indexing.

How was this patch tested?

unit tests added

apoorvedave1 · 2020-12-16T01:10:20Z

Note to reviewers: Tests for delete scenarios have not been added yet. Please have a look for general idea and comments. Nit comments are ok but not necessary at this time. Thank you.

apoorvedave1 · 2020-12-16T19:02:50Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshScan.scala

+
+  /** df representing the complete set of data files which will be indexed once this refresh action
+   * finishes. */
+  override protected lazy val df = {


Note: This df represents all the data this will be indexed after this Action finishes.
E.g. if previously indexed files are f1, f2, f3
Newly added files are f4, f5, f6
but scan pattern matches only f4

Then this df will represent f1, f2, f3, f4

df is used in creation of metadata to represent the current snapshot of indexed data files (as before)

this is also required for correct signature calculation of indexed data files

apoorvedave1 · 2020-12-16T19:03:40Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshScan.scala

+  def isMatch(path: String, scanPattern: String): Boolean = {
+    val scanSplits = scanPattern.split(Path.SEPARATOR)
+    scanSplits.nonEmpty && path.split(Path.SEPARATOR).contains(scanSplits.head)
+  }


true if a path matches scan pattern

apoorvedave1 · 2020-12-16T19:09:08Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshScan.scala

+  def resolve(path: String, scanPattern: String): String = {
+    val scanSplits: Array[String] = scanPattern.split(Path.SEPARATOR)
+    val pathSplits: Array[String] = path.split(Path.SEPARATOR)
+    val splitPoint: Int = pathSplits.lastIndexOf(scanSplits.head)
+    var (prefix, suffix) = pathSplits.splitAt(splitPoint)
+
+    for (j <- 0 until math.max(scanSplits.length, suffix.length)) {
+      val resolvedPart = (Try(suffix(j)), Try(scanSplits(j))) match {
+        case (Success(path), Success(scan)) if FilenameUtils.wildcardMatch(path, scan) => path
+        case (Success(path), Success(scan)) if FilenameUtils.wildcardMatch(scan, path) => scan
+        case (Success(_), Success(_)) => throw new Exception("Incompatible scan pattern")
+        case (Success(path), Failure(_)) => path
+        case (Failure(_), Success(scan)) => scan
+        case _ => throw new Exception("Unexpected Exception")
+      }
+
+      prefix :+= resolvedPart
+    }
+    prefix.mkString(Path.SEPARATOR)
+  }


resolves a path with scan pattern. E.g.

path = "c:/data/" scanPattern = "data/y=20*" resolvedPath = "c:/data/y=20*" e.g. path = "c:/root/data/*/*/*" scanPattern = "data/a*" resolvedPath = "c:/root/data/a*/*/*" // for every level, we pick the more restrictive part. e.g. from * and a* we choose a* because it's more restrictive.

apoorvedave1 · 2020-12-16T19:11:11Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshScan.scala

+  override protected lazy val currentFiles: Set[FileInfo] = {
+    val relation = previousIndexLogEntry.relations.head
+    val dataSchema = DataType.fromJson(relation.dataSchemaJson).asInstanceOf[StructType]
+    val changedDf = spark.read
+      .schema(dataSchema)
+      .format(relation.fileFormat)
+      .options(relation.options)
+      .load(resolvedPaths: _*)
+    changedDf.queryExecution.optimizedPlan
+      .collect {
+        case LogicalRelation(
+            HadoopFsRelation(location: PartitioningAwareFileIndex, _, _, _, _, _),
+            _,
+            _,
+            _) =>
+          location
+            .allFiles()
+            .map { f =>
+              // For each file, if it already has a file id, add that id to its corresponding
+              // FileInfo. Note that if content of an existing file is changed, it is treated
+              // as a new file (i.e. its current file id is no longer valid).
+              val id = fileIdTracker.addFile(f)
+              FileInfo(f, id, asFullPath = true)
+            }
+      }
+      .flatten
+      .toSet
+  }


current files represents ALL files which match the scan pattern. Some of these files may already be indexed"

apoorvedave1 · 2020-12-16T19:12:34Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshScan.scala

+  override protected lazy val deletedFiles: Seq[FileInfo] = {
+    // Helper function to check if a file belongs to one of the resolved paths.
+    def fromResolvedPaths(file: FileInfo): Boolean = {
+      resolvedPaths.exists(p => FilenameUtils.wildcardMatch(file.name, p))
+    }
+
+    val originalFiles = previousIndexLogEntry.relations.head.data.properties.content.fileInfos
+      .filter(fromResolvedPaths)
+
+    (originalFiles -- currentFiles).toSeq
+  }


deleted files which match the scan pattern. We do not handle deleted files outside of scan pattern to avoid full file listing.

apoorvedave1 · 2020-12-16T19:14:05Z

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

+        // Store basePath of hive-partitioned data sources, if applicable.
+        val basePath = options.get("basePath") match {
+          case None => PathUtils.extractBasePath(location.partitionSpec())
+          case p => p
+        }
+
        // "path" key in options can incur multiple data read unexpectedly.
-        val opts = options - "path"
+        // Since "options" is case-insensitive map, it will change any previous entries of
+        // "basePath" to lowercase "basepath", making it unusable.
+        // Remove lowercase "basepath" and add proper cased "basePath".
+        val opts = basePath match {
+          case Some(path) => Map("basePath" -> path) ++ options - "path" - "basepath"
+          case _ => options - "path" - "basepath"
+        }


this is from another dependency PR #281

apoorvedave1 · 2020-12-16T19:14:38Z

src/main/scala/com/microsoft/hyperspace/util/PathUtils.scala

+  /**
+   * Extract base data source path for from a given partition spec.
+   * @param partitionSpec PartitionSpec.
+   * @return Optional base path if partition spec is non empty. Else, None.
+   */
+  def extractBasePath(partitionSpec: PartitionSpec): Option[String] = {
+    if (partitionSpec == PartitionSpec.emptySpec) {
+      None
+    } else {
+      // For example, we could have the following in PartitionSpec:
+      //   - partition columns = "col1", "col2"
+      //   - partitions: "/path/col1=1/col2=1", "/path/col1=1/col2=2", etc.
+      // , and going up the same number of directory levels as the number of partition columns
+      // will compute the base path. Note that PartitionSpec.partitions will always contain
+      // all the partitions in the path, so "partitions.head" is taken as an initial value.
+      val basePath = partitionSpec.partitionColumns
+        .foldLeft(partitionSpec.partitions.head.path)((path, _) => path.getParent)
+      Some(basePath.toString)
+    }
+  }
+


from dependency PR #281

Could you clean up extractBasePath impl?

sezruby

Could you simply measure and share the performance described in "Success criteria" in #233 before moving forward?

I think the only gain is not performing listFiles for all source files ( list files => toSet => calc appended / deleted files), but a user should specify the exact scan patterns for all update. (if not, the index won't be applied)
If a user wants to run a query only for that portion of the data, then that makes sense though it makes hard to maintain the index.

sezruby · 2020-12-17T03:35:36Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshScan.scala

+  /** df representing the complete set of data files which will be indexed once this refresh action
+   * finishes. */
+  override protected lazy val df = {
+    val relation = previousIndexLogEntry.relations.head


Need to call val latestRelation = Hyperspace.getContext(spark).sourceProviderManager.refreshRelation(relations.head)

sezruby · 2020-12-21T01:40:29Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshScan.scala

+        changedDf.queryExecution.optimizedPlan
+          .collect {
+            case LogicalRelation(
+                HadoopFsRelation(location: PartitioningAwareFileIndex, _, _, _, _, _),


Need to use SourceProvider API + If we don't support Delta Lake or any other sources, we need to check the source type and throw a proper exception.
Please create a function like "isScanPatternRefreshSupported" and throw an exception.

apoorvedave1 · 2021-01-11T21:49:08Z

Could you simply measure and share the performance described in "Success criteria" in #233 before moving forward?

I think the only gain is not performing listFiles for all source files ( list files => toSet => calc appended / deleted files), but a user should specify the exact scan patterns for all update. (if not, the index won't be applied)
If a user wants to run a query only for that portion of the data, then that makes sense though it makes hard to maintain the index.

Yes that's the only goal. If user appended a small bunch of files, we could avoid listing all files. I am working on the perf part but other than that, could you please review the code for functionality? (thanks for the other comments as well).

(We are not aiming for this specialized scenario where the user has partial indexes and also wants to query the partial data set on which the index is created)

cc @rapoth, @imback82
Question to @rapoth : Does the customer see listing file during index creation as bottleneck, or if CX really wanted to refresh only couple of partitions where the data still comes in for other partitions?
As @sezruby pointed above, if the customer only refreshes index on partial data (which satisfies scan pattern), the index will not be applied (without enabling hybrid scan)

# Conflicts: # src/main/scala/com/microsoft/hyperspace/index/IndexCollectionManager.scala # src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

imback82 · 2021-01-13T02:23:49Z

Question to @rapoth : Does the customer see listing file during index creation as bottleneck, or if CX really wanted to refresh only couple of partitions where the data still comes in for other partitions?
As @sezruby pointed above, if the customer only refreshes index on partial data (which satisfies scan pattern), the index will not be applied (without enabling hybrid scan)

We should be able to figure out whether the partition in question is up to data (instead of enabling hybrid scan), no?

rapoth · 2021-01-13T03:38:58Z

Could you simply measure and share the performance described in "Success criteria" in #233 before moving forward?
I think the only gain is not performing listFiles for all source files ( list files => toSet => calc appended / deleted files), but a user should specify the exact scan patterns for all update. (if not, the index won't be applied)
If a user wants to run a query only for that portion of the data, then that makes sense though it makes hard to maintain the index.

Yes that's the only goal. If user appended a small bunch of files, we could avoid listing all files. I am working on the perf part but other than that, could you please review the code for functionality? (thanks for the other comments as well).

(We are not aiming for this specialized scenario where the user has partial indexes and also wants to query the partial data set on which the index is created)

cc @rapoth, @imback82
Question to @rapoth : Does the customer see listing file during index creation as bottleneck, or if CX really wanted to refresh only couple of partitions where the data still comes in for other partitions?
As @sezruby pointed above, if the customer only refreshes index on partial data (which satisfies scan pattern), the index will not be applied (without enabling hybrid scan)

The specific ask has been only for the case where refresh is needed on a couple of partitions. How can we solve this: listing file during index creation? Isn't that mandatory? Maybe I am missing something subtle here.

sezruby · 2021-01-13T12:09:05Z

Actually I doubt the usability of this feature.. What if a user tires to refresh index with
hs.refreshIndex("index", "incremental", Some("data/y=2020/*")) after hs.refreshIndex("index", "incremental", Some("data/y=2021/*")) ?

To apply the index, a user should set the exact file lists when they are writing the query, without Hybrid Scan. In other words, the relation in their query should contain only original source file list + refreshed source file list (with scan pattern), historically.
In addition, push down partition filters might not work with partially refreshed indexes.

@rapoth @imback82 WDYT?

imback82 · 2021-01-13T16:15:35Z

To apply the index, a user should set the exact file lists when they are writing the query, without Hybrid Scan. In other words, the relation in their query should contain only original source file list + refreshed source file list (with scan pattern), historically.

I assume if the user is using this feature, they would select the right partition in their query.

I think we need to look at partition columns in the user query and only consider source files under that partition when applying index & Hybrid scan. (This should improve Hybrid scan scenario quite a bit)

sezruby · 2021-01-14T11:11:41Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshScan.scala

+
+  /** paths resolved with scan pattern.
+   * Paths merged With scan pattern to choose more selective option
+   * e.g. if rootPath


Could you update this?

sezruby · 2021-01-14T12:20:15Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshScan.scala

+    } else {
+      // New entry.
+      entry
+    }


Coud you try this?

val updatedEntry = super[RefreshIncrementalAction].logEntry()

sezruby · 2021-01-14T12:20:56Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshScan.scala

+      entry
+    }
+
+    val relation = entry.source.plan.properties.relations.head


Suggested change

val relation = entry.source.plan.properties.relations.head

val relation = updatedEntry.source.plan.properties.relations.head

sezruby · 2021-01-14T12:36:23Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshScan.scala

+    val relation = entry.source.plan.properties.relations.head
+    val updatedRelation =
+      relation.copy(rootPaths = previousIndexLogEntry.relations.head.rootPaths)
+    updatedEntry.copy(


This seems not clear. Instead of this, how about passing ORIGINAL_ROOT_PATHS option via df.options? we could add it in val df = and exclude it in createRelation?

sezruby · 2021-01-14T12:46:12Z

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

+            // Root paths could be directories or leaf files. Make sure that all root paths either
+            // match the glob paths, in case of directories, or belong to glob paths, in case of
+            // files.
+            if (!location.rootPaths.forall(p =>


Is this relevant to this PR or a missing part in the previous PR?

sezruby · 2021-01-14T12:47:34Z

src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

      // `InMemoryFileIndex` implementation returns `SerializableFileStatus` instead of the
      // standard API's `FileStatus`.
-      index.allFiles.map(_.asInstanceOf[FileStatus])
+      index.allFiles


Please keep this implementation (see above comments)

sezruby · 2021-01-14T12:49:43Z

src/main/scala/com/microsoft/hyperspace/util/PathUtils.scala

+  /**
+   * Extract base data source path for from a given partition spec.
+   * @param partitionSpec PartitionSpec.
+   * @return Optional base path if partition spec is non empty. Else, None.
+   */
+  def extractBasePath(partitionSpec: PartitionSpec): Option[String] = {
+    if (partitionSpec == PartitionSpec.emptySpec) {
+      None
+    } else {
+      // For example, we could have the following in PartitionSpec:
+      //   - partition columns = "col1", "col2"
+      //   - partitions: "/path/col1=1/col2=1", "/path/col1=1/col2=2", etc.
+      // , and going up the same number of directory levels as the number of partition columns
+      // will compute the base path. Note that PartitionSpec.partitions will always contain
+      // all the partitions in the path, so "partitions.head" is taken as an initial value.
+      val basePath = partitionSpec.partitionColumns
+        .foldLeft(partitionSpec.partitions.head.path)((path, _) => path.getParent)
+      Some(basePath.toString)
+    }
+  }
+


Could you clean up extractBasePath impl?

sezruby · 2021-01-14T12:53:42Z

src/test/scala/com/microsoft/hyperspace/index/RefreshScanTestsGlobData.scala

+import com.microsoft.hyperspace.index.IndexConstants.GLOBBING_PATTERN_KEY
+import com.microsoft.hyperspace.util.PathUtils
+
+class RefreshScanTestsGlobData extends QueryTest with HyperspaceSuite {


Could you add tests for DeltaLake source relation? Since it may require additional code change, we need to add the tests in this change. Otherwise, we could open a new PR for delta lake test based on this PR to check the behavior.

initial commit

f34f40c

apoorvedave1 requested review from imback82, pirz, rapoth and sezruby December 15, 2020 19:52

apoorvedave1 self-assigned this Dec 15, 2020

apoorvedave1 added the enhancement New feature or request label Dec 15, 2020

apoorvedave1 and others added 3 commits December 15, 2020 11:52

Merge branch 'master' into scanpattern

2534e82

Trigger Build

ab2ff2f

scalastyle

1009056

apoorvedave1 changed the title ~~Incremental Refresh support with scan patterns~~ [WIP] Incremental Refresh support with scan patterns Dec 16, 2020

apoorvedave1 marked this pull request as draft December 16, 2020 01:10

apoorvedave1 commented Dec 16, 2020

View reviewed changes

apoorvedave1 changed the title ~~[WIP] Incremental Refresh support with scan patterns~~ Incremental Refresh support with scan patterns Dec 17, 2020

apoorvedave1 marked this pull request as ready for review December 17, 2020 00:25

support deleted directories

a3833d9

This was linked to issues Dec 17, 2020

[PROPOSAL]: Support for Scan Pattern for Scalable Refresh Index for Large Data Sources #276

Open

[FEATURE REQUEST]: Scalable Refresh Index for Large Data Sources #233

Open

rapoth added this to the December 2020 milestone Dec 17, 2020

test fix

a55060b

sezruby reviewed Dec 21, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into scanpattern

fdeff6f

# Conflicts: # src/main/scala/com/microsoft/hyperspace/index/IndexCollectionManager.scala # src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala

sezruby reviewed Jan 14, 2021

View reviewed changes

sezruby mentioned this pull request Jan 27, 2021

Partition filter aware index application #338

Open

clee704 removed this from the December 2020 milestone Jun 15, 2021

clee704 added the stale There was no activity for this issue/PR for a while label Jun 15, 2021

	val relation = entry.source.plan.properties.relations.head
	val relation = updatedEntry.source.plan.properties.relations.head

Incremental Refresh support with scan patterns #298

Are you sure you want to change the base?

Incremental Refresh support with scan patterns #298

Uh oh!

Conversation

apoorvedave1 commented Dec 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

apoorvedave1 commented Dec 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sezruby left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apoorvedave1 commented Jan 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imback82 commented Jan 13, 2021

Uh oh!

rapoth commented Jan 13, 2021

Uh oh!

sezruby commented Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imback82 commented Jan 13, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

apoorvedave1 commented Dec 15, 2020 •

edited

Loading

sezruby left a comment •

edited

Loading

apoorvedave1 commented Jan 11, 2021 •

edited

Loading

sezruby commented Jan 13, 2021 •

edited

Loading