Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@apoorvedave1
Copy link
Contributor

@apoorvedave1 apoorvedave1 commented Dec 15, 2020

What is the context for this pull request?

Note to reviewers: Please follow the tracking issue for general algorithm used.

What changes were proposed in this pull request?

This PR supports incremental index refresh when scan patterns are provided

hyperspace.refreshIndex("indexName", scanPattern="<scan-pattern>")

Therefore, the user would use this API in one of the following ways:

Data change in a particular partition:
refreshIndex("idx1", "incremental", scanPattern=Some("data/year=2020/month=10/day=29/hour=03/*"))
Data change in a particular hierarchy:
refreshIndex("idx1", "incremental", scanPattern=Some("data/year=2020/month=07/*"))

Supported scenarios:

// Case 1: Hive Partitioned Data
val df = spark.read.format(...).load("/path/to/data/")
hs.createIndex(..., IndexConfig("index", ...)
hs.refreshIndex("index", "incremental", Some("data/*"))  // everything
hs.refreshIndex("index", "incremental", Some("data/y=2020/*")) // everything from y=2020
hs.refreshIndex("index", "incremental", Some("data/y=*/m=10")) // everything where y = anything, m = 10
hs.refreshIndex("index", "incremental", Some("data/y=20*/m=10/")) // everything where y starts with 20, m = 10

// Case 2: Globbing Pattern Data
val df = spark.read.format(...).option("spark.hyperspace.source.globbingPattern", "/path/to/data/*/*/*").load("/path/to/data/*/*/*") // Globbing pattern option as per #269 
hs.createIndex(..., IndexConfig("index", ...)
hs.refreshIndex("index", "incremental", Some("data/*"))  // everything
hs.refreshIndex("index", "incremental", Some("data/a/*")) // everything from data/a
hs.refreshIndex("index", "incremental", Some("data/a/b/*")) // everything from data/a/b
hs.refreshIndex("index", "incremental", Some("data/*/b/")) // every folder at depth 2 which ends with b

Does this PR introduce any user-facing change?

Yes we introduce new refresh apis which can take scan pattern.

Constraints:
The scan pattern MUST start with the data folder which was picked for indexing. In the above examples, the scan pattern starts with data which is the base path for indexing.

How was this patch tested?

unit tests added

@apoorvedave1 apoorvedave1 self-assigned this Dec 15, 2020
@apoorvedave1 apoorvedave1 added the enhancement New feature or request label Dec 15, 2020
@apoorvedave1 apoorvedave1 changed the title Incremental Refresh support with scan patterns [WIP] Incremental Refresh support with scan patterns Dec 16, 2020
@apoorvedave1
Copy link
Contributor Author

Note to reviewers: Tests for delete scenarios have not been added yet. Please have a look for general idea and comments. Nit comments are ok but not necessary at this time. Thank you.

@apoorvedave1 apoorvedave1 marked this pull request as draft December 16, 2020 01:10

/** df representing the complete set of data files which will be indexed once this refresh action
* finishes. */
override protected lazy val df = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This df represents all the data this will be indexed after this Action finishes.
E.g. if previously indexed files are f1, f2, f3
Newly added files are f4, f5, f6
but scan pattern matches only f4

Then this df will represent f1, f2, f3, f4

df is used in creation of metadata to represent the current snapshot of indexed data files (as before)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also required for correct signature calculation of indexed data files

Comment on lines 52 to 55
def isMatch(path: String, scanPattern: String): Boolean = {
val scanSplits = scanPattern.split(Path.SEPARATOR)
scanSplits.nonEmpty && path.split(Path.SEPARATOR).contains(scanSplits.head)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true if a path matches scan pattern

Comment on lines 57 to 76
def resolve(path: String, scanPattern: String): String = {
val scanSplits: Array[String] = scanPattern.split(Path.SEPARATOR)
val pathSplits: Array[String] = path.split(Path.SEPARATOR)
val splitPoint: Int = pathSplits.lastIndexOf(scanSplits.head)
var (prefix, suffix) = pathSplits.splitAt(splitPoint)

for (j <- 0 until math.max(scanSplits.length, suffix.length)) {
val resolvedPart = (Try(suffix(j)), Try(scanSplits(j))) match {
case (Success(path), Success(scan)) if FilenameUtils.wildcardMatch(path, scan) => path
case (Success(path), Success(scan)) if FilenameUtils.wildcardMatch(scan, path) => scan
case (Success(_), Success(_)) => throw new Exception("Incompatible scan pattern")
case (Success(path), Failure(_)) => path
case (Failure(_), Success(scan)) => scan
case _ => throw new Exception("Unexpected Exception")
}

prefix :+= resolvedPart
}
prefix.mkString(Path.SEPARATOR)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolves a path with scan pattern. E.g.

path = "c:/data/"
scanPattern = "data/y=20*"
resolvedPath = "c:/data/y=20*"

e.g.
path = "c:/root/data/*/*/*"
scanPattern = "data/a*"
resolvedPath = "c:/root/data/a*/*/*"   // for every level, we pick the more restrictive part. e.g. from * and a* we choose a* because it's more restrictive.

Comment on lines 130 to 157
override protected lazy val currentFiles: Set[FileInfo] = {
val relation = previousIndexLogEntry.relations.head
val dataSchema = DataType.fromJson(relation.dataSchemaJson).asInstanceOf[StructType]
val changedDf = spark.read
.schema(dataSchema)
.format(relation.fileFormat)
.options(relation.options)
.load(resolvedPaths: _*)
changedDf.queryExecution.optimizedPlan
.collect {
case LogicalRelation(
HadoopFsRelation(location: PartitioningAwareFileIndex, _, _, _, _, _),
_,
_,
_) =>
location
.allFiles()
.map { f =>
// For each file, if it already has a file id, add that id to its corresponding
// FileInfo. Note that if content of an existing file is changed, it is treated
// as a new file (i.e. its current file id is no longer valid).
val id = fileIdTracker.addFile(f)
FileInfo(f, id, asFullPath = true)
}
}
.flatten
.toSet
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current files represents ALL files which match the scan pattern. Some of these files may already be indexed"

Comment on lines +118 to +128
override protected lazy val deletedFiles: Seq[FileInfo] = {
// Helper function to check if a file belongs to one of the resolved paths.
def fromResolvedPaths(file: FileInfo): Boolean = {
resolvedPaths.exists(p => FilenameUtils.wildcardMatch(file.name, p))
}

val originalFiles = previousIndexLogEntry.relations.head.data.properties.content.fileInfos
.filter(fromResolvedPaths)

(originalFiles -- currentFiles).toSeq
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted files which match the scan pattern. We do not handle deleted files outside of scan pattern to avoid full file listing.

Comment on lines 76 to 89
// Store basePath of hive-partitioned data sources, if applicable.
val basePath = options.get("basePath") match {
case None => PathUtils.extractBasePath(location.partitionSpec())
case p => p
}

// "path" key in options can incur multiple data read unexpectedly.
val opts = options - "path"
// Since "options" is case-insensitive map, it will change any previous entries of
// "basePath" to lowercase "basepath", making it unusable.
// Remove lowercase "basepath" and add proper cased "basePath".
val opts = basePath match {
case Some(path) => Map("basePath" -> path) ++ options - "path" - "basepath"
case _ => options - "path" - "basepath"
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is from another dependency PR #281

Comment on lines +31 to +51
/**
* Extract base data source path for from a given partition spec.
* @param partitionSpec PartitionSpec.
* @return Optional base path if partition spec is non empty. Else, None.
*/
def extractBasePath(partitionSpec: PartitionSpec): Option[String] = {
if (partitionSpec == PartitionSpec.emptySpec) {
None
} else {
// For example, we could have the following in PartitionSpec:
// - partition columns = "col1", "col2"
// - partitions: "/path/col1=1/col2=1", "/path/col1=1/col2=2", etc.
// , and going up the same number of directory levels as the number of partition columns
// will compute the base path. Note that PartitionSpec.partitions will always contain
// all the partitions in the path, so "partitions.head" is taken as an initial value.
val basePath = partitionSpec.partitionColumns
.foldLeft(partitionSpec.partitions.head.path)((path, _) => path.getParent)
Some(basePath.toString)
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from dependency PR #281

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clean up extractBasePath impl?

@apoorvedave1 apoorvedave1 changed the title [WIP] Incremental Refresh support with scan patterns Incremental Refresh support with scan patterns Dec 17, 2020
@apoorvedave1 apoorvedave1 marked this pull request as ready for review December 17, 2020 00:25
Copy link
Collaborator

@sezruby sezruby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you simply measure and share the performance described in "Success criteria" in #233 before moving forward?

I think the only gain is not performing listFiles for all source files ( list files => toSet => calc appended / deleted files), but a user should specify the exact scan patterns for all update. (if not, the index won't be applied)
If a user wants to run a query only for that portion of the data, then that makes sense though it makes hard to maintain the index.

/** df representing the complete set of data files which will be indexed once this refresh action
* finishes. */
override protected lazy val df = {
val relation = previousIndexLogEntry.relations.head
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to call val latestRelation = Hyperspace.getContext(spark).sourceProviderManager.refreshRelation(relations.head)

changedDf.queryExecution.optimizedPlan
.collect {
case LogicalRelation(
HadoopFsRelation(location: PartitioningAwareFileIndex, _, _, _, _, _),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to use SourceProvider API + If we don't support Delta Lake or any other sources, we need to check the source type and throw a proper exception.
Please create a function like "isScanPatternRefreshSupported" and throw an exception.

@apoorvedave1
Copy link
Contributor Author

apoorvedave1 commented Jan 11, 2021

Could you simply measure and share the performance described in "Success criteria" in #233 before moving forward?

I think the only gain is not performing listFiles for all source files ( list files => toSet => calc appended / deleted files), but a user should specify the exact scan patterns for all update. (if not, the index won't be applied)
If a user wants to run a query only for that portion of the data, then that makes sense though it makes hard to maintain the index.

Yes that's the only goal. If user appended a small bunch of files, we could avoid listing all files. I am working on the perf part but other than that, could you please review the code for functionality? (thanks for the other comments as well).

(We are not aiming for this specialized scenario where the user has partial indexes and also wants to query the partial data set on which the index is created)

cc @rapoth, @imback82
Question to @rapoth : Does the customer see listing file during index creation as bottleneck, or if CX really wanted to refresh only couple of partitions where the data still comes in for other partitions?
As @sezruby pointed above, if the customer only refreshes index on partial data (which satisfies scan pattern), the index will not be applied (without enabling hybrid scan)

# Conflicts:
#	src/main/scala/com/microsoft/hyperspace/index/IndexCollectionManager.scala
#	src/main/scala/com/microsoft/hyperspace/index/sources/default/DefaultFileBasedSource.scala
@imback82
Copy link
Contributor

Question to @rapoth : Does the customer see listing file during index creation as bottleneck, or if CX really wanted to refresh only couple of partitions where the data still comes in for other partitions?
As @sezruby pointed above, if the customer only refreshes index on partial data (which satisfies scan pattern), the index will not be applied (without enabling hybrid scan)

We should be able to figure out whether the partition in question is up to data (instead of enabling hybrid scan), no?

@rapoth
Copy link
Contributor

rapoth commented Jan 13, 2021

Could you simply measure and share the performance described in "Success criteria" in #233 before moving forward?
I think the only gain is not performing listFiles for all source files ( list files => toSet => calc appended / deleted files), but a user should specify the exact scan patterns for all update. (if not, the index won't be applied)
If a user wants to run a query only for that portion of the data, then that makes sense though it makes hard to maintain the index.

Yes that's the only goal. If user appended a small bunch of files, we could avoid listing all files. I am working on the perf part but other than that, could you please review the code for functionality? (thanks for the other comments as well).

(We are not aiming for this specialized scenario where the user has partial indexes and also wants to query the partial data set on which the index is created)

cc @rapoth, @imback82
Question to @rapoth : Does the customer see listing file during index creation as bottleneck, or if CX really wanted to refresh only couple of partitions where the data still comes in for other partitions?
As @sezruby pointed above, if the customer only refreshes index on partial data (which satisfies scan pattern), the index will not be applied (without enabling hybrid scan)

The specific ask has been only for the case where refresh is needed on a couple of partitions. How can we solve this: listing file during index creation? Isn't that mandatory? Maybe I am missing something subtle here.

@sezruby
Copy link
Collaborator

sezruby commented Jan 13, 2021

Actually I doubt the usability of this feature.. What if a user tires to refresh index with
hs.refreshIndex("index", "incremental", Some("data/y=2020/*")) after hs.refreshIndex("index", "incremental", Some("data/y=2021/*")) ?

To apply the index, a user should set the exact file lists when they are writing the query, without Hybrid Scan. In other words, the relation in their query should contain only original source file list + refreshed source file list (with scan pattern), historically.
In addition, push down partition filters might not work with partially refreshed indexes.

@rapoth @imback82 WDYT?

@imback82
Copy link
Contributor

To apply the index, a user should set the exact file lists when they are writing the query, without Hybrid Scan. In other words, the relation in their query should contain only original source file list + refreshed source file list (with scan pattern), historically.

I assume if the user is using this feature, they would select the right partition in their query.

I think we need to look at partition columns in the user query and only consider source files under that partition when applying index & Hybrid scan. (This should improve Hybrid scan scenario quite a bit)


/** paths resolved with scan pattern.
* Paths merged With scan pattern to choose more selective option
* e.g. if rootPath
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update this?

} else {
// New entry.
entry
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coud you try this?

val updatedEntry = super[RefreshIncrementalAction].logEntry()

entry
}

val relation = entry.source.plan.properties.relations.head
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val relation = entry.source.plan.properties.relations.head
val relation = updatedEntry.source.plan.properties.relations.head

val relation = entry.source.plan.properties.relations.head
val updatedRelation =
relation.copy(rootPaths = previousIndexLogEntry.relations.head.rootPaths)
updatedEntry.copy(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not clear. Instead of this, how about passing ORIGINAL_ROOT_PATHS option via df.options? we could add it in val df = and exclude it in createRelation?

// Root paths could be directories or leaf files. Make sure that all root paths either
// match the glob paths, in case of directories, or belong to glob paths, in case of
// files.
if (!location.rootPaths.forall(p =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this relevant to this PR or a missing part in the previous PR?

// `InMemoryFileIndex` implementation returns `SerializableFileStatus` instead of the
// standard API's `FileStatus`.
index.allFiles.map(_.asInstanceOf[FileStatus])
index.allFiles
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep this implementation (see above comments)

Comment on lines +31 to +51
/**
* Extract base data source path for from a given partition spec.
* @param partitionSpec PartitionSpec.
* @return Optional base path if partition spec is non empty. Else, None.
*/
def extractBasePath(partitionSpec: PartitionSpec): Option[String] = {
if (partitionSpec == PartitionSpec.emptySpec) {
None
} else {
// For example, we could have the following in PartitionSpec:
// - partition columns = "col1", "col2"
// - partitions: "/path/col1=1/col2=1", "/path/col1=1/col2=2", etc.
// , and going up the same number of directory levels as the number of partition columns
// will compute the base path. Note that PartitionSpec.partitions will always contain
// all the partitions in the path, so "partitions.head" is taken as an initial value.
val basePath = partitionSpec.partitionColumns
.foldLeft(partitionSpec.partitions.head.path)((path, _) => path.getParent)
Some(basePath.toString)
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clean up extractBasePath impl?

import com.microsoft.hyperspace.index.IndexConstants.GLOBBING_PATTERN_KEY
import com.microsoft.hyperspace.util.PathUtils

class RefreshScanTestsGlobData extends QueryTest with HyperspaceSuite {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests for DeltaLake source relation? Since it may require additional code change, we need to add the tests in this change. Otherwise, we could open a new PR for delta lake test based on this PR to check the behavior.

@clee704 clee704 removed this from the December 2020 milestone Jun 15, 2021
@clee704 clee704 added the stale There was no activity for this issue/PR for a while label Jun 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request stale There was no activity for this issue/PR for a while

Projects

None yet

5 participants