Update index log entry for enforce delete during read time #170

pirz · 2020-09-19T03:19:19Z

What is the context for this pull request?

This PR adds changes to index metadata for capturing list of source data files deleted or appended and updated index fingerprint accordingly.
List of deleted files are needed as part of adding support for enforcing delete during read time.

Tracking Issue: Issue Add index metadata update in RefreshIndex for quick mode #169
Parent Issue: Issue Add support for deletes in RefreshIndex for quick mode #134

What changes were proposed in this pull request?

This PR adds changes for updating index metadata once some data files are deleted from or appended to an index source data files.
This is done by:

Extending IndexLogEntry structure to save:
- a list of deleted source data files, called deleted, and
- a list of appended source data files, called appended.
Adding a new refresh action for creating a newer version of an existing index, by updating index metadata as:
- Detect deleted and appended source data files.
- Add deleted source data files to deleted and appended source data files to appended in index metadata.
- Update index fingerprint according to latest source data files. (This is required for Hyperspace to correctly matches index with a query written on latest source data files).

deleted property is used to enforce delete during query time. Once index is leveraged. Index records coming from already deleted source data files, listed under deleted, are excluded from contributing to query results. This is done via PR #175.

Currently, this feature is protected under a Spark configuration flag: spark.hyperspace.index.refresh.source.content.enabled and is disabled by default.

Does this PR introduce any user-facing change?

Yes, this PR modifies index metadata and extends IndexLogEntry structure by adding a new fields deleted: Seq[String] and appended: Seq[String] under:
IndexLogEntry.source.plan.properties.relations.data.properties which captures a list of source data files which are deleted and added to index's source data files.

Old experience:

User creates an index on some data e.g., "/path/to/dataset/".
User enables Hyperspace and issues a query. Hyperspace is able to use the index.
User deletes some files from the original data and/or adds some new files under "/path/to/dataset/".
User issues a query but Hyperspace detects data change and decides to disable index usage.
User invokes refresh to update the index.
Hyperspace does a full index rebuild.
Now, if a query is issued on latest source data files, Hyperspace can leverage index.

New experience:
Steps 1 - 4 remain the same.

If user disabled spark.hyperspace.index.refresh.source.content.enabled then Hyperspace experience remains similar to 5 and 6 above.
If user enables spark.hyperspace.index.refresh.source.content.enabled and calls refresh then:
1. Hyperspace detects the deleted and appended files and computes index signature according to latest dataset files. It updates index metadata by adding deleted and appended files and updating index signature.
2. User can now issues queries and Hyperspace will use the index.

Impact on IndexLogEntry content
Example of IndexLogEntry before this change:

{
  "name" : "filterIndex",
  "derivedDataset" : {...},
  "content" : {...},
  "source" : {
    "plan" : {
      "properties" : {
        "relations" : [ {
          "rootPaths" : [ "file:/C:/..." ],
          "data" : {
            "properties" : {
              "content" : {
                "root" : {
                  "name" : "file:/C:/",
                  "files" : [ ],
                  "subDirs" : [ {...} ]
                },
                "fingerprint" : {...}
              }
            },
            "kind" : "HDFS"
          },
          "dataSchemaJson" : "...",
          "fileFormat" : "parquet",
          "options" : { }
        } ],
        "rawPlan" : null,
        "sql" : null,
        "fingerprint" : {...}
      },
      "kind" : "Spark"
    }
  },
  "extra" : { },
  "version" : "0.1",
  "id" : 3,
  "state" : "ACTIVE",
  "timestamp" : ...,
  "enabled" : true
}

New IndexLogEntry example (after this change):
(deleted and appended are added under source.plan.properties.relations.data.properties).

{
  "name" : "filterIndex",
  "derivedDataset" : {...},
  "content" : {...},
  "source" : {
    "plan" : {
      "properties" : {
        "relations" : [ {
          "rootPaths" : [ "file:/C:/..." ],
          "data" : {
            "properties" : {
              "content" : {
                "root" : {
                  "name" : "file:/C:/",
                  "files" : [ ],
                  "subDirs" : [ {...} ]
                },
                "fingerprint" : {...}
              },
                "deleted" : [ "file:/C:/.../part-00000-8fc.parquet" ],
                "appended" : [ "file:/C:/.../part-00000-9ac.parquet" ]
            },
            "kind" : "HDFS"
          },
          "dataSchemaJson" : "...",
          "fileFormat" : "parquet",
          "options" : { }
        } ],
        "rawPlan" : null,
        "sql" : null,
        "fingerprint" : {...}
      },
      "kind" : "Spark"
    }
  },
  "extra" : { },
  "version" : "0.1",
  "id" : 3,
  "state" : "ACTIVE",
  "timestamp" : ...,
  "enabled" : true
}

How was this patch tested?

New test cases added under RefreshIndexTests.scala and E2EHyperspaceRulesTests.scala.

pirz · 2020-09-19T03:28:23Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteActionBase.scala

+import com.microsoft.hyperspace.HyperspaceException
+import com.microsoft.hyperspace.index.{Content, IndexDataManager, IndexLogManager}
+
+private[actions] abstract class RefreshDeleteActionBase(


To Reviewers:
This class is a simple code refactor. validate and deletedFiles defs are copied from RefreshDeleteAction class here as they are shared between RefreshDeleteAction and DeleteOnReadAction classes.
The code is the same as before, except for validate that now has an extra check.

pirz · 2020-09-19T03:29:28Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala

  final override protected def event(appInfo: AppInfo, message: String): HyperspaceEvent = {
    RefreshDeleteActionEvent(appInfo, logEntry.asInstanceOf[IndexLogEntry], message)
  }



To Reviewers: validate is moved to (new) class RefreshDeleteActionBase.

Cool, thanks for letting us know.

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala

imback82 · 2020-09-19T04:46:12Z

This PR is marked as WIP and I was requested to review. Usually, WIP means "not ready for review" yet.

I will mark this PR as draft, and please ping me back when you convert this back to a regular PR (click "Ready for review").

pirz · 2020-09-19T20:46:05Z

This PR is marked as WIP and I was requested to review. Usually, WIP means "not ready for review" yet.

I will mark this PR as draft, and please ping me back when you convert this back to a regular PR (click "Ready for review").

@imback82 It is no longer a draft, ready to be reviewed.

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteActionBase.scala

src/main/scala/com/microsoft/hyperspace/actions/DeleteOnReadAction.scala

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteActionBase.scala

src/main/scala/com/microsoft/hyperspace/actions/DeleteOnReadAction.scala

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteActionBase.scala

src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTests.scala

rapoth · 2020-09-30T02:22:30Z

@pirz In your PR description, it'd be nice to also capture the details surrounding the user experience that this PR is enabling? For instance, user enables this flag and this is what happens.

rapoth · 2020-09-30T20:12:08Z

@pirz In your PR description, it'd be nice to also capture the details surrounding the user experience that this PR is enabling? For instance, user enables this flag and this is what happens.

@pirz This is the specific comment I was interested in seeing being reflected in the PR. Please refer to your previous PR where we wrote down step by step user actions and how this PR fits in.

pirz · 2020-09-30T23:03:55Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteActionBase.scala

+import com.microsoft.hyperspace.HyperspaceException
+import com.microsoft.hyperspace.index.{Content, IndexDataManager, IndexLogManager}
+
+private[actions] abstract class RefreshDeleteActionBase(


To Reviewers This class needs a rename, as its children are no longer dealing only with delete-related items. Any suggestion?

pirz · 2020-09-30T23:11:47Z

@pirz In your PR description, it'd be nice to also capture the details surrounding the user experience that this PR is enabling? For instance, user enables this flag and this is what happens.

@pirz This is the specific comment I was interested in seeing being reflected in the PR. Please refer to your previous PR where we wrote down step by step user actions and how this PR fits in.

PR description is now updated accordingly. Please kindly take a look.

rapoth · 2020-10-01T00:01:19Z

@pirz I spent some time thinking about this. What are your thoughts on this?

We change the PR title to say: Update index meta-data only for enabling Smart Refresh and disconnect this PR from the work pertaining to delete (we will still leverage this PR but this PR is more general IMHO). We handle capturing both deleted and appended files in the index meta-data as it's already implemented in this PR.
Rename the classes appropriately to remove the occurrence of the word Delete in any of the Refresh Actions. We should call these along the lines of RefreshMetadataOnlyAction if need be. @imback82 can help propose the correct names.
For your next PR on enforce-delete-on-read, there will be no special explicit flags. Let us aim on reusing the code paths from Hybrid scan (if we do it right, maybe we don't need another PR or we need a small PR to make this happen). Everything will be implicit and will be under mode='smart' or mode='quick'. Therefore, I propose the following:
1. When the user calls refreshIndex(mode='quick'), we update the meta-data accordingly. When the user runs a query, we can then enforce-delete-on-read automatically. It follows that we should get rid of this flag spark.hyperspace.index.refresh.source.content.enabled. If the user did not call this refreshIndex(mode='quick'), we can disable index usage.
2. When the user calls refreshIndex(mode='smart'), we go rewrite portions of the index - basically, the previous PR. If the user did not call this refreshIndex(mode='smart'), we can disable index usage.

Please let me know if this makes sense. To complete the loop, for append, here's what would happen:

When the user calls refreshIndex(mode='quick'), we update the meta-data accordingly AND start an incremental index job which indexes the new data. When the user runs a query, we can reuse the new incrementally indexed data. If the user did not call this refreshIndex(mode='quick'), we can disable index usage.
When the user calls refreshIndex(mode='smart'), we update the meta-data accordingly AND start an incremental index job which indexes the new data + optimize existing small files. If the user did not call this refreshIndex(mode='smart'), we can disable index usage.

CC: @imback82 @sezruby @apoorvedave1

pirz · 2020-10-01T00:49:44Z

Sure @rapoth - Thanks. Here are some questions:

We change the PR title to say: Update index meta-data only for enabling Smart Refresh and disconnect this PR from the work pertaining to delete

I assume you mean ".. for enabling Quick refresh" as metadata changes are gonna be used in quick mode.

Rename the classes appropriately to remove the occurrence of the word Delete in any of the Refresh Actions. We should call these along the lines of RefreshMetadataOnlyAction if need be. @imback82 can help propose the correct names.

I assume this does not include RefreshDeleteAction (which is already in master), as that is a standalone class and deals fully with delete (will be used in smart mode).

For your next PR on enforce-delete-on-read, there will be no special explicit flags. Let us aim on reusing the code paths from Hybrid scan (if we do it right, maybe we don't need another PR or we need a small PR to make this happen). Everything will be implicit and will be under mode='smart' or mode='quick'. Therefore, I propose the following:

Sure, but can you clarify how that consolidated code path will be used? (Sorry, I am a bit confused for this part).
With no flags, should it rely on mode. I assume mode will not be introduced in this PR, then the next PR will have dependency on both this PR (for metadata changes) and some other PR which introduces mode and API changes.

sezruby · 2020-10-01T00:50:27Z

@rapoth There is no use of "appended" file list in the description.

If we always build the incremental index, it's not necessary to keep the appended file list.

I think we could have another option (for better hybrid scan) to use both list - refreshIndex(mode='metadata')
(or 'metadataOnly', 'signature', 'forHybrid'..?)
In this way, the user can clearly see which index could be reused at query time. (in case hybrid scan config is disabled)
With this, we could suggest this until we optimize the rank algorithm properly. WDYT??

And for naming, I prefer
"RefreshSourceFilesAction"
and
"MetadataUpdateActionBase"

rapoth · 2020-10-01T01:08:42Z

@rapoth There is no use of "appended" file list in the description.

If we always build the incremental index, it's not necessary to keep the appended file list.

I think we could have another option (for better hybrid scan) to use both list - refreshIndex(mode='metadata')
(or 'metadataOnly', 'signature', 'forHybrid'..?)
In this way, the user can clearly see which index could be reused at query time. (in case hybrid scan config is disabled)
With this, we could suggest this until we optimize the rank algorithm properly. WDYT??

And for naming, I prefer
"RefreshSourceFilesAction"
and
"MetadataUpdateActionBase"

Yes, I requested @pirz to update the description appropriately since the code is out-of-sync. Long story short, appended files is not going to be useful for incremental index for appends. It will only be useful for hybrid scan - is this correct?

If the list of appended files goes in as part of this PR, can Hybrid scan utilize it? I like refreshIndex(mode='metadata') - @imback82 / @pirz ?

@sezruby Will this mode='metadata' be in addition to smart and quick?

rapoth · 2020-10-01T01:12:20Z

Sure @rapoth - Thanks. Here are some questions:

We change the PR title to say: Update index meta-data only for enabling Smart Refresh and disconnect this PR from the work pertaining to delete

I assume you mean ".. for enabling Quick refresh" as metadata changes are gonna be used in quick mode.

Yes, sorry my bad.

Rename the classes appropriately to remove the occurrence of the word Delete in any of the Refresh Actions. We should call these along the lines of RefreshMetadataOnlyAction if need be. @imback82 can help propose the correct names.

I assume this does not include RefreshDeleteAction (which is already in master), as that is a standalone class and deals fully with delete (will be used in smart mode).

Yep - let's wait for @imback82 to make the final recommendation on the class names and hierarchies.

For your next PR on enforce-delete-on-read, there will be no special explicit flags. Let us aim on reusing the code paths from Hybrid scan (if we do it right, maybe we don't need another PR or we need a small PR to make this happen). Everything will be implicit and will be under mode='smart' or mode='quick'. Therefore, I propose the following:

Sure, but can you clarify how that consolidated code path will be used? (Sorry, I am a bit confused for this part).

I think you and @sezruby should sync up to see how we can reuse that portion of the hybrid scan code path.

With no flags, should it rely on mode. I assume mode will not be introduced in this PR, then the next PR will have dependency on both this PR (for metadata changes) and some other PR which introduces mode and API changes.

Yes, that's right.

rapoth · 2020-10-01T01:16:34Z

And for naming, I prefer
"RefreshSourceFilesAction"
and
"MetadataUpdateActionBase"

Sorry @sezruby, I missed this. @imback82 please consider this when you are thinking of other renames that may be needed.

pirz · 2020-10-01T01:34:57Z

@rapoth There is no use of "appended" file list in the description.
If we always build the incremental index, it's not necessary to keep the appended file list.
I think we could have another option (for better hybrid scan) to use both list - refreshIndex(mode='metadata')
(or 'metadataOnly', 'signature', 'forHybrid'..?)
In this way, the user can clearly see which index could be reused at query time. (in case hybrid scan config is disabled)
With this, we could suggest this until we optimize the rank algorithm properly. WDYT??
And for naming, I prefer
"RefreshSourceFilesAction"
and
"MetadataUpdateActionBase"

Yes, I requested @pirz to update the description appropriately since the code is out-of-sync. Long story short, appended files is not going to be useful for incremental index for appends. It will only be useful for hybrid scan - is this correct?

If the list of appended files goes in as part of this PR, can Hybrid scan utilize it? I like refreshIndex(mode='metadata') - @imback82 / @pirz ?

@sezruby Will this mode='metadata' be in addition to smart and quick?

Sorry, but I am a bit confused here. PR description was updated and it was talking about "appended" and "deleted" files, before these comments. Can you please clarify if we are talking about this PR or some other PR?

rapoth · 2020-10-01T01:37:00Z

@rapoth There is no use of "appended" file list in the description.
If we always build the incremental index, it's not necessary to keep the appended file list.
I think we could have another option (for better hybrid scan) to use both list - refreshIndex(mode='metadata')
(or 'metadataOnly', 'signature', 'forHybrid'..?)
In this way, the user can clearly see which index could be reused at query time. (in case hybrid scan config is disabled)
With this, we could suggest this until we optimize the rank algorithm properly. WDYT??
And for naming, I prefer
"RefreshSourceFilesAction"
and
"MetadataUpdateActionBase"

Yes, I requested @pirz to update the description appropriately since the code is out-of-sync. Long story short, appended files is not going to be useful for incremental index for appends. It will only be useful for hybrid scan - is this correct?
If the list of appended files goes in as part of this PR, can Hybrid scan utilize it? I like refreshIndex(mode='metadata') - @imback82 / @pirz ?
@sezruby Will this mode='metadata' be in addition to smart and quick?

Sorry, but I am a bit confused here. PR description was updated and it was talking about "appended" and "deleted" files, before these comments. Can you please clarify if we are talking about this PR or some other PR?

Thanks! I'm good from my side. Apparently, I was looking at an alternate version. @sezruby ?

sezruby · 2020-10-01T01:49:26Z

@pirz @rapoth Sorry the description I mentioned is the one in comment #170 (comment)

Yes, 'metadata' is an additional refresh mode to use hybrid scan w/o query time detection of source file list.
It will also allow selective hybrid scan for each index

rapoth · 2020-10-01T02:21:02Z

@pirz @rapoth Sorry the description I mentioned is the one in comment #170 (comment)

Ok, last question. I re-read my comment and I did mention appended files.

We handle capturing both deleted and appended files in the index meta-data as it's already implemented in this PR.

imback82 · 2020-10-01T02:32:31Z

And for naming, I prefer
"RefreshSourceFilesAction"
and
"MetadataUpdateActionBase"

Sorry @sezruby, I missed this. @imback82 please consider this when you are thinking of other renames that may be needed.

Sorry for the late response (busy at work 😄). The names sound reasonable, but how about RefreshSourceMetadataAction (RefreshSourceFilesAction sounds like we are actually refreshing source files - whatever that is).

Now that we are introducing metadata only action, I think we may need to revisit the class hierarchy for refresh. For example, making MetadataUpdateActionBase extend RefreshActionBase (which extends CreateActionBase) doesn't seem like a like good hierarchy. @pirz Can you give a shot at this? If not, I can take a look on Friday. I think we would know better what the name of MetadataUpdateActionBase would be when we rethink on the hierarchy (it could become MetadataUpdateAction as a trait, for example.)

rapoth · 2020-10-01T02:37:25Z

And for naming, I prefer
"RefreshSourceFilesAction"
and
"MetadataUpdateActionBase"

Sorry @sezruby, I missed this. @imback82 please consider this when you are thinking of other renames that may be needed.

Sorry for the late response (busy at work 😄). The names sound reasonable, but how about RefreshSourceMetadataAction (RefreshSourceFilesAction sounds like we are actually refreshing source files - whatever that is).

Yes, being explicit about Metadata in the Action name is a good call. Thanks!

pirz · 2020-10-27T17:20:35Z

#188 addressed parts of the changes in this PR. New metadata action will be added as a separate PR.

Metadata changes for enforce delete on read

55f146b

pirz requested review from apoorvedave1, imback82, rapoth and sezruby September 19, 2020 03:20

pirz added this to the 0.4.0 milestone Sep 19, 2020

pirz commented Sep 19, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala Show resolved Hide resolved

rapoth linked an issue Sep 19, 2020 that may be closed by this pull request

Add index metadata update in RefreshIndex for quick mode #169

Closed

imback82 marked this pull request as draft September 19, 2020 04:46

Pouria Pirzadeh added 2 commits September 19, 2020 12:12

fix IndexLogEntry test

e9e6e33

add signature update to enforce delete on read

777b57e

pirz changed the title ~~[WIP] Update index log entry to enforce delete during read time~~ Update index log entry to enforce delete during read time Sep 19, 2020

pirz changed the title ~~Update index log entry to enforce delete during read time~~ Update index log entry for enforce delete during read time Sep 19, 2020

fix test cleanup

fcc3952

pirz marked this pull request as ready for review September 19, 2020 20:45

rapoth added advanced issue This is the tag for advanced issues which involve major design changes or introduction enhancement New feature or request labels Sep 21, 2020

rapoth assigned pirz Sep 21, 2020

Pouria Pirzadeh added 2 commits September 21, 2020 11:06

changes in excluded files update

8a36e60

changes in refresh delete validate

821ccd5

pirz commented Sep 21, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteActionBase.scala Outdated Show resolved Hide resolved

sezruby reviewed Sep 22, 2020

View reviewed changes

pirz mentioned this pull request Sep 23, 2020

[WIP] Modify optimizer rules to leverage an index with deleted source data file(s) #175

Closed

Pouria Pirzadeh added 2 commits September 24, 2020 10:52

minor code refactor in IndexLogEntry

f27a8a7

add code comments

bf21c8d

sezruby reviewed Sep 25, 2020

View reviewed changes

add RefreshLogEntryAction

91a136f

Action and config renames

a6f6fc0

pirz commented Sep 30, 2020

View reviewed changes

pirz mentioned this pull request Oct 2, 2020

RefreshDelete should update appended files list in metadata #179

Closed

This was referenced Oct 7, 2020

RefreshAppend should update appended/deleted files list in metadata #183

Closed

Merge append and delete actions on indexes on modified source data #187

Merged

Add "appended" and "deleted" source files in index metadata #188

Merged

imback82 mentioned this pull request Oct 10, 2020

Disable indexes with non-empty "appended" or "deleted" files if hybrid-scan is disabled #194

Merged

imback82 modified the milestones: 0.4.0, 0.5.0, October 2020, November 2020 Oct 13, 2020

pirz closed this Oct 27, 2020

rapoth removed a link to an issue Oct 29, 2020

Add index metadata update in RefreshIndex for quick mode #169

Closed

Update index log entry for enforce delete during read time #170

Update index log entry for enforce delete during read time #170

Uh oh!

Conversation

pirz commented Sep 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

pirz Sep 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pirz Sep 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imback82 Sep 26, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

imback82 commented Sep 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pirz commented Sep 19, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rapoth commented Sep 30, 2020

Uh oh!

rapoth commented Sep 30, 2020

Uh oh!

pirz Sep 30, 2020

Choose a reason for hiding this comment

Uh oh!

pirz commented Sep 30, 2020

Uh oh!

rapoth commented Oct 1, 2020

Uh oh!

pirz commented Oct 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sezruby commented Oct 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rapoth commented Oct 1, 2020

Uh oh!

rapoth commented Oct 1, 2020

Uh oh!

rapoth commented Oct 1, 2020

Uh oh!

pirz commented Oct 1, 2020

Uh oh!

rapoth commented Oct 1, 2020

Uh oh!

sezruby commented Oct 1, 2020 • edited by rapoth Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rapoth commented Oct 1, 2020

Uh oh!

imback82 commented Oct 1, 2020

Uh oh!

rapoth commented Oct 1, 2020

Uh oh!

pirz commented Oct 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

pirz commented Sep 19, 2020 •

edited

Loading

pirz Sep 19, 2020 •

edited

Loading

pirz Sep 19, 2020 •

edited

Loading

imback82 commented Sep 19, 2020 •

edited

Loading

pirz commented Oct 1, 2020 •

edited

Loading

sezruby commented Oct 1, 2020 •

edited

Loading

sezruby commented Oct 1, 2020 •

edited by rapoth

Loading