-
Notifications
You must be signed in to change notification settings - Fork 116
Conversation
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/RefreshActionBase.scala
Outdated
Show resolved
Hide resolved
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did one round of review. Curious to see how @sezruby's comment will change.
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/RefreshActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/RefreshIncrementalAction.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexConstants.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexConstants.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/RefreshActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/RefreshIncrementalAction.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall approach seems fine to me.
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
| // For a given file, file id is only meaningful in the context of a given | ||
| // index. At this point, we do not know which index, if any, would be picked. | ||
| // Therefore, we simply set the file id to UNKNOWN_FILE_ID. | ||
| FileInfo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR, but question: @sezruby can we not use FileInfo here (so that we can remove UNKNOWN_FILE_ID)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imback82 FileInfo now has id and id is meaningful in the context of a given index. At this point, we dont really know which index is gonna be picked, if any. So id is really unknown here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't isHybridScanCandidate updated to take in Seq[FileStatus] instead? Seems like we need to change only val commonCnt = inputSourceFiles.count(entry.sourceFileInfoSet.contains) to work with FileStatus? (btw, we don't need to address this in this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's not necessarily FileInfo and also we can utilize tag feature to keep appended and deleted file info in getCandidateIndexes. I'll create an issue for this. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala
Outdated
Show resolved
Hide resolved
|
@sezruby Any other comments? I plan to do a release on 11/17 PST. Thanks! |
sezruby
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM Thanks!
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
| // For a given file, file id is only meaningful in the context of a given | ||
| // index. At this point, we do not know which index, if any, would be picked. | ||
| // Therefore, we simply set the file id to UNKNOWN_FILE_ID. | ||
| FileInfo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's not necessarily FileInfo and also we can utilize tag feature to keep appended and deleted file info in getCandidateIndexes. I'll create an issue for this. Thanks
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // For a given file, file id is only meaningful in the context of a given | ||
| // index. At this point, we do not know which index, if any, would be picked. | ||
| // Therefore, we simply set the file id to UNKNOWN_FILE_ID. | ||
| FileInfo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
src/test/scala/com/microsoft/hyperspace/index/IndexManagerTests.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/RefreshIndexTests.scala
Outdated
Show resolved
Hide resolved
| /** | ||
| * Provides functionality to generate unique file ids for files. | ||
| */ | ||
| class FileIdTracker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class has many logics to validate and I think we need unit tests for this class. I will create an issue to follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #258
What is the context for this pull request?
This PR makes two major changes:
For more details on the design and implementation, check below proposal.
What changes were proposed in this pull request?
Assign unique ids to source data files and index files
When creating/refreshing an index, Hyperspace generates a unique id per file and use it to track the files.
Combination of (filePath, size, modificationTime) is used as the key to identify a unique file for id generation. Therefore, if content of an existing file changes, Hyperspace treats it as a new file and assigns a new id to it.
This id is stored in
FileInfoand shows up in the index metadata:Change lineage from String to numeric id values
The data type for the lineage column is changed from String to an integral data type and instead of storing full path to source data file as lineage value, we store the id associated to that file as the lineage value.
Change create index plan for lineage
Instead of using a UDF to calculate/fill lineage values, the value of lineage is filled by doing a (broadcast) join between the index DF and a lineage DF that maps each source data file to its file id.
Here is a an example Create Index plan:
Performance evaluation
Robustness and performance of above changes is evaluated on a workload consisting of 37 indexes of various sizes on data generated via TPCDS dsdgen for SF=1000GB and Numeric lineage with Join showed 18% to 23% of performance improvement (depending on source data file physical layout) and 5X less storage bloat when creating indexes comparing to storing full file path as lineage values.
Does this PR introduce any user-facing change?
Yes, it affects index metadata and adds new fields to it.
How was this patch tested?
Existing test cases are modified to adopt the change and are used to verify it.