-
Notifications
You must be signed in to change notification settings - Fork 116
Update index log entry for enforce delete during read time #170
Conversation
| import com.microsoft.hyperspace.HyperspaceException | ||
| import com.microsoft.hyperspace.index.{Content, IndexDataManager, IndexLogManager} | ||
|
|
||
| private[actions] abstract class RefreshDeleteActionBase( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To Reviewers:
This class is a simple code refactor. validate and deletedFiles defs are copied from RefreshDeleteAction class here as they are shared between RefreshDeleteAction and DeleteOnReadAction classes.
The code is the same as before, except for validate that now has an extra check.
| final override protected def event(appInfo: AppInfo, message: String): HyperspaceEvent = { | ||
| RefreshDeleteActionEvent(appInfo, logEntry.asInstanceOf[IndexLogEntry], message) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To Reviewers: validate is moved to (new) class RefreshDeleteActionBase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for letting us know.
src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala
Show resolved
Hide resolved
|
This PR is marked as WIP and I was requested to review. Usually, WIP means "not ready for review" yet. I will mark this PR as draft, and please ping me back when you convert this back to a regular PR (click "Ready for review"). |
@imback82 It is no longer a draft, ready to be reviewed. |
src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/DeleteOnReadAction.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteActionBase.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/DeleteOnReadAction.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteActionBase.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTests.scala
Outdated
Show resolved
Hide resolved
|
@pirz In your PR description, it'd be nice to also capture the details surrounding the user experience that this PR is enabling? For instance, user enables this flag and this is what happens. |
@pirz This is the specific comment I was interested in seeing being reflected in the PR. Please refer to your previous PR where we wrote down step by step user actions and how this PR fits in. |
| import com.microsoft.hyperspace.HyperspaceException | ||
| import com.microsoft.hyperspace.index.{Content, IndexDataManager, IndexLogManager} | ||
|
|
||
| private[actions] abstract class RefreshDeleteActionBase( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To Reviewers This class needs a rename, as its children are no longer dealing only with delete-related items. Any suggestion?
PR description is now updated accordingly. Please kindly take a look. |
|
@pirz I spent some time thinking about this. What are your thoughts on this?
Please let me know if this makes sense. To complete the loop, for append, here's what would happen:
|
|
Sure @rapoth - Thanks. Here are some questions:
I assume you mean ".. for enabling Quick refresh" as metadata changes are gonna be used in quick mode.
I assume this does not include RefreshDeleteAction (which is already in master), as that is a standalone class and deals fully with delete (will be used in smart mode).
Sure, but can you clarify how that consolidated code path will be used? (Sorry, I am a bit confused for this part). |
|
@rapoth There is no use of "appended" file list in the description. If we always build the incremental index, it's not necessary to keep the appended file list. I think we could have another option (for better hybrid scan) to use both list - refreshIndex(mode='metadata') And for naming, I prefer |
Yes, I requested @pirz to update the description appropriately since the code is out-of-sync. Long story short, appended files is not going to be useful for incremental index for appends. It will only be useful for hybrid scan - is this correct? If the list of appended files goes in as part of this PR, can Hybrid scan utilize it? I like @sezruby Will this |
Yes, sorry my bad.
Yep - let's wait for @imback82 to make the final recommendation on the class names and hierarchies.
I think you and @sezruby should sync up to see how we can reuse that portion of the hybrid scan code path.
Yes, that's right. |
Sorry, but I am a bit confused here. PR description was updated and it was talking about "appended" and "deleted" files, before these comments. Can you please clarify if we are talking about this PR or some other PR? |
Thanks! I'm good from my side. Apparently, I was looking at an alternate version. @sezruby ? |
Sorry for the late response (busy at work 😄). The names sound reasonable, but how about Now that we are introducing metadata only action, I think we may need to revisit the class hierarchy for refresh. For example, making |
Yes, being explicit about Metadata in the Action name is a good call. Thanks! |
|
#188 addressed parts of the changes in this PR. New metadata action will be added as a separate PR. |
What is the context for this pull request?
This PR adds changes to index metadata for capturing list of source data files deleted or appended and updated index fingerprint accordingly.
List of deleted files are needed as part of adding support for enforcing delete during read time.
What changes were proposed in this pull request?
This PR adds changes for updating index metadata once some data files are deleted from or appended to an index source data files.
This is done by:
IndexLogEntrystructure to save:deleted, andappended.deletedand appended source data files toappendedin index metadata.deletedproperty is used to enforce delete during query time. Once index is leveraged. Index records coming from already deleted source data files, listed underdeleted, are excluded from contributing to query results. This is done via PR #175.Currently, this feature is protected under a Spark configuration flag:
spark.hyperspace.index.refresh.source.content.enabledand is disabled by default.Does this PR introduce any user-facing change?
Yes, this PR modifies index metadata and extends
IndexLogEntrystructure by adding a new fieldsdeleted: Seq[String]andappended: Seq[String]under:IndexLogEntry.source.plan.properties.relations.data.propertieswhich captures a list of source data files which are deleted and added to index's source data files.Old experience:
New experience:
Steps 1 - 4 remain the same.
spark.hyperspace.index.refresh.source.content.enabledthen Hyperspace experience remains similar to 5 and 6 above.spark.hyperspace.index.refresh.source.content.enabledand calls refresh then:Impact on
IndexLogEntrycontentExample of IndexLogEntry before this change:
New IndexLogEntry example (after this change):
(
deletedandappendedare added undersource.plan.properties.relations.data.properties).How was this patch tested?
New test cases added under
RefreshIndexTests.scalaandE2EHyperspaceRulesTests.scala.