Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@pirz
Copy link
Contributor

@pirz pirz commented Sep 19, 2020

What is the context for this pull request?

This PR adds changes to index metadata for capturing list of source data files deleted or appended and updated index fingerprint accordingly.
List of deleted files are needed as part of adding support for enforcing delete during read time.

What changes were proposed in this pull request?

This PR adds changes for updating index metadata once some data files are deleted from or appended to an index source data files.
This is done by:

  1. Extending IndexLogEntry structure to save:
    • a list of deleted source data files, called deleted, and
    • a list of appended source data files, called appended.
  2. Adding a new refresh action for creating a newer version of an existing index, by updating index metadata as:
    • Detect deleted and appended source data files.
    • Add deleted source data files to deleted and appended source data files to appended in index metadata.
    • Update index fingerprint according to latest source data files. (This is required for Hyperspace to correctly matches index with a query written on latest source data files).

deleted property is used to enforce delete during query time. Once index is leveraged. Index records coming from already deleted source data files, listed under deleted, are excluded from contributing to query results. This is done via PR #175.

Currently, this feature is protected under a Spark configuration flag: spark.hyperspace.index.refresh.source.content.enabled and is disabled by default.

Does this PR introduce any user-facing change?

Yes, this PR modifies index metadata and extends IndexLogEntry structure by adding a new fields deleted: Seq[String] and appended: Seq[String] under:
IndexLogEntry.source.plan.properties.relations.data.properties which captures a list of source data files which are deleted and added to index's source data files.

Old experience:

  1. User creates an index on some data e.g., "/path/to/dataset/".
  2. User enables Hyperspace and issues a query. Hyperspace is able to use the index.
  3. User deletes some files from the original data and/or adds some new files under "/path/to/dataset/".
  4. User issues a query but Hyperspace detects data change and decides to disable index usage.
  5. User invokes refresh to update the index.
  6. Hyperspace does a full index rebuild.
  7. Now, if a query is issued on latest source data files, Hyperspace can leverage index.

New experience:
Steps 1 - 4 remain the same.

  • If user disabled spark.hyperspace.index.refresh.source.content.enabled then Hyperspace experience remains similar to 5 and 6 above.
  • If user enables spark.hyperspace.index.refresh.source.content.enabled and calls refresh then:
    1. Hyperspace detects the deleted and appended files and computes index signature according to latest dataset files. It updates index metadata by adding deleted and appended files and updating index signature.
    2. User can now issues queries and Hyperspace will use the index.

Impact on IndexLogEntry content
Example of IndexLogEntry before this change:

{
  "name" : "filterIndex",
  "derivedDataset" : {...},
  "content" : {...},
  "source" : {
    "plan" : {
      "properties" : {
        "relations" : [ {
          "rootPaths" : [ "file:/C:/..." ],
          "data" : {
            "properties" : {
              "content" : {
                "root" : {
                  "name" : "file:/C:/",
                  "files" : [ ],
                  "subDirs" : [ {...} ]
                },
                "fingerprint" : {...}
              }
            },
            "kind" : "HDFS"
          },
          "dataSchemaJson" : "...",
          "fileFormat" : "parquet",
          "options" : { }
        } ],
        "rawPlan" : null,
        "sql" : null,
        "fingerprint" : {...}
      },
      "kind" : "Spark"
    }
  },
  "extra" : { },
  "version" : "0.1",
  "id" : 3,
  "state" : "ACTIVE",
  "timestamp" : ...,
  "enabled" : true
}

New IndexLogEntry example (after this change):
(deleted and appended are added under source.plan.properties.relations.data.properties).

{
  "name" : "filterIndex",
  "derivedDataset" : {...},
  "content" : {...},
  "source" : {
    "plan" : {
      "properties" : {
        "relations" : [ {
          "rootPaths" : [ "file:/C:/..." ],
          "data" : {
            "properties" : {
              "content" : {
                "root" : {
                  "name" : "file:/C:/",
                  "files" : [ ],
                  "subDirs" : [ {...} ]
                },
                "fingerprint" : {...}
              },
                "deleted" : [ "file:/C:/.../part-00000-8fc.parquet" ],
                "appended" : [ "file:/C:/.../part-00000-9ac.parquet" ]
            },
            "kind" : "HDFS"
          },
          "dataSchemaJson" : "...",
          "fileFormat" : "parquet",
          "options" : { }
        } ],
        "rawPlan" : null,
        "sql" : null,
        "fingerprint" : {...}
      },
      "kind" : "Spark"
    }
  },
  "extra" : { },
  "version" : "0.1",
  "id" : 3,
  "state" : "ACTIVE",
  "timestamp" : ...,
  "enabled" : true
}

How was this patch tested?

New test cases added under RefreshIndexTests.scala and E2EHyperspaceRulesTests.scala.

@pirz pirz added this to the 0.4.0 milestone Sep 19, 2020
import com.microsoft.hyperspace.HyperspaceException
import com.microsoft.hyperspace.index.{Content, IndexDataManager, IndexLogManager}

private[actions] abstract class RefreshDeleteActionBase(
Copy link
Contributor Author

@pirz pirz Sep 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To Reviewers:
This class is a simple code refactor. validate and deletedFiles defs are copied from RefreshDeleteAction class here as they are shared between RefreshDeleteAction and DeleteOnReadAction classes.
The code is the same as before, except for validate that now has an extra check.

final override protected def event(appInfo: AppInfo, message: String): HyperspaceEvent = {
RefreshDeleteActionEvent(appInfo, logEntry.asInstanceOf[IndexLogEntry], message)
}

Copy link
Contributor Author

@pirz pirz Sep 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To Reviewers: validate is moved to (new) class RefreshDeleteActionBase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for letting us know.

@rapoth rapoth linked an issue Sep 19, 2020 that may be closed by this pull request
@imback82
Copy link
Contributor

imback82 commented Sep 19, 2020

This PR is marked as WIP and I was requested to review. Usually, WIP means "not ready for review" yet.

I will mark this PR as draft, and please ping me back when you convert this back to a regular PR (click "Ready for review").

@imback82 imback82 marked this pull request as draft September 19, 2020 04:46
@pirz pirz changed the title [WIP] Update index log entry to enforce delete during read time Update index log entry to enforce delete during read time Sep 19, 2020
@pirz pirz changed the title Update index log entry to enforce delete during read time Update index log entry for enforce delete during read time Sep 19, 2020
@pirz pirz marked this pull request as ready for review September 19, 2020 20:45
@pirz
Copy link
Contributor Author

pirz commented Sep 19, 2020

This PR is marked as WIP and I was requested to review. Usually, WIP means "not ready for review" yet.

I will mark this PR as draft, and please ping me back when you convert this back to a regular PR (click "Ready for review").

@imback82 It is no longer a draft, ready to be reviewed.

@rapoth rapoth added advanced issue This is the tag for advanced issues which involve major design changes or introduction enhancement New feature or request labels Sep 21, 2020
@rapoth
Copy link
Contributor

rapoth commented Sep 30, 2020

@pirz In your PR description, it'd be nice to also capture the details surrounding the user experience that this PR is enabling? For instance, user enables this flag and this is what happens.

@rapoth
Copy link
Contributor

rapoth commented Sep 30, 2020

@pirz In your PR description, it'd be nice to also capture the details surrounding the user experience that this PR is enabling? For instance, user enables this flag and this is what happens.

@pirz This is the specific comment I was interested in seeing being reflected in the PR. Please refer to your previous PR where we wrote down step by step user actions and how this PR fits in.

import com.microsoft.hyperspace.HyperspaceException
import com.microsoft.hyperspace.index.{Content, IndexDataManager, IndexLogManager}

private[actions] abstract class RefreshDeleteActionBase(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To Reviewers This class needs a rename, as its children are no longer dealing only with delete-related items. Any suggestion?

@pirz
Copy link
Contributor Author

pirz commented Sep 30, 2020

@pirz In your PR description, it'd be nice to also capture the details surrounding the user experience that this PR is enabling? For instance, user enables this flag and this is what happens.

@pirz This is the specific comment I was interested in seeing being reflected in the PR. Please refer to your previous PR where we wrote down step by step user actions and how this PR fits in.

PR description is now updated accordingly. Please kindly take a look.

@rapoth
Copy link
Contributor

rapoth commented Oct 1, 2020

@pirz I spent some time thinking about this. What are your thoughts on this?

  1. We change the PR title to say: Update index meta-data only for enabling Smart Refresh and disconnect this PR from the work pertaining to delete (we will still leverage this PR but this PR is more general IMHO). We handle capturing both deleted and appended files in the index meta-data as it's already implemented in this PR.
  2. Rename the classes appropriately to remove the occurrence of the word Delete in any of the Refresh Actions. We should call these along the lines of RefreshMetadataOnlyAction if need be. @imback82 can help propose the correct names.
  3. For your next PR on enforce-delete-on-read, there will be no special explicit flags. Let us aim on reusing the code paths from Hybrid scan (if we do it right, maybe we don't need another PR or we need a small PR to make this happen). Everything will be implicit and will be under mode='smart' or mode='quick'. Therefore, I propose the following:
    1. When the user calls refreshIndex(mode='quick'), we update the meta-data accordingly. When the user runs a query, we can then enforce-delete-on-read automatically. It follows that we should get rid of this flag spark.hyperspace.index.refresh.source.content.enabled. If the user did not call this refreshIndex(mode='quick'), we can disable index usage.
    2. When the user calls refreshIndex(mode='smart'), we go rewrite portions of the index - basically, the previous PR. If the user did not call this refreshIndex(mode='smart'), we can disable index usage.

Please let me know if this makes sense. To complete the loop, for append, here's what would happen:

  1. When the user calls refreshIndex(mode='quick'), we update the meta-data accordingly AND start an incremental index job which indexes the new data. When the user runs a query, we can reuse the new incrementally indexed data. If the user did not call this refreshIndex(mode='quick'), we can disable index usage.
  2. When the user calls refreshIndex(mode='smart'), we update the meta-data accordingly AND start an incremental index job which indexes the new data + optimize existing small files. If the user did not call this refreshIndex(mode='smart'), we can disable index usage.

CC: @imback82 @sezruby @apoorvedave1

@pirz
Copy link
Contributor Author

pirz commented Oct 1, 2020

Sure @rapoth - Thanks. Here are some questions:

We change the PR title to say: Update index meta-data only for enabling Smart Refresh and disconnect this PR from the work pertaining to delete

I assume you mean ".. for enabling Quick refresh" as metadata changes are gonna be used in quick mode.

Rename the classes appropriately to remove the occurrence of the word Delete in any of the Refresh Actions. We should call these along the lines of RefreshMetadataOnlyAction if need be. @imback82 can help propose the correct names.

I assume this does not include RefreshDeleteAction (which is already in master), as that is a standalone class and deals fully with delete (will be used in smart mode).

For your next PR on enforce-delete-on-read, there will be no special explicit flags. Let us aim on reusing the code paths from Hybrid scan (if we do it right, maybe we don't need another PR or we need a small PR to make this happen). Everything will be implicit and will be under mode='smart' or mode='quick'. Therefore, I propose the following:

Sure, but can you clarify how that consolidated code path will be used? (Sorry, I am a bit confused for this part).
With no flags, should it rely on mode. I assume mode will not be introduced in this PR, then the next PR will have dependency on both this PR (for metadata changes) and some other PR which introduces mode and API changes.

@sezruby
Copy link
Collaborator

sezruby commented Oct 1, 2020

@rapoth There is no use of "appended" file list in the description.

If we always build the incremental index, it's not necessary to keep the appended file list.

I think we could have another option (for better hybrid scan) to use both list - refreshIndex(mode='metadata')
(or 'metadataOnly', 'signature', 'forHybrid'..?)
In this way, the user can clearly see which index could be reused at query time. (in case hybrid scan config is disabled)
With this, we could suggest this until we optimize the rank algorithm properly. WDYT??

And for naming, I prefer
"RefreshSourceFilesAction"
and
"MetadataUpdateActionBase"

@rapoth
Copy link
Contributor

rapoth commented Oct 1, 2020

@rapoth There is no use of "appended" file list in the description.

If we always build the incremental index, it's not necessary to keep the appended file list.

I think we could have another option (for better hybrid scan) to use both list - refreshIndex(mode='metadata')
(or 'metadataOnly', 'signature', 'forHybrid'..?)
In this way, the user can clearly see which index could be reused at query time. (in case hybrid scan config is disabled)
With this, we could suggest this until we optimize the rank algorithm properly. WDYT??

And for naming, I prefer
"RefreshSourceFilesAction"
and
"MetadataUpdateActionBase"

Yes, I requested @pirz to update the description appropriately since the code is out-of-sync. Long story short, appended files is not going to be useful for incremental index for appends. It will only be useful for hybrid scan - is this correct?

If the list of appended files goes in as part of this PR, can Hybrid scan utilize it? I like refreshIndex(mode='metadata') - @imback82 / @pirz ?

@sezruby Will this mode='metadata' be in addition to smart and quick?

@rapoth
Copy link
Contributor

rapoth commented Oct 1, 2020

Sure @rapoth - Thanks. Here are some questions:

  • We change the PR title to say: Update index meta-data only for enabling Smart Refresh and disconnect this PR from the work pertaining to delete

I assume you mean ".. for enabling Quick refresh" as metadata changes are gonna be used in quick mode.

Yes, sorry my bad.

  • Rename the classes appropriately to remove the occurrence of the word Delete in any of the Refresh Actions. We should call these along the lines of RefreshMetadataOnlyAction if need be. @imback82 can help propose the correct names.

I assume this does not include RefreshDeleteAction (which is already in master), as that is a standalone class and deals fully with delete (will be used in smart mode).

Yep - let's wait for @imback82 to make the final recommendation on the class names and hierarchies.

  • For your next PR on enforce-delete-on-read, there will be no special explicit flags. Let us aim on reusing the code paths from Hybrid scan (if we do it right, maybe we don't need another PR or we need a small PR to make this happen). Everything will be implicit and will be under mode='smart' or mode='quick'. Therefore, I propose the following:

Sure, but can you clarify how that consolidated code path will be used? (Sorry, I am a bit confused for this part).

I think you and @sezruby should sync up to see how we can reuse that portion of the hybrid scan code path.

With no flags, should it rely on mode. I assume mode will not be introduced in this PR, then the next PR will have dependency on both this PR (for metadata changes) and some other PR which introduces mode and API changes.

Yes, that's right.

@rapoth
Copy link
Contributor

rapoth commented Oct 1, 2020

And for naming, I prefer
"RefreshSourceFilesAction"
and
"MetadataUpdateActionBase"

Sorry @sezruby, I missed this. @imback82 please consider this when you are thinking of other renames that may be needed.

@pirz
Copy link
Contributor Author

pirz commented Oct 1, 2020

@rapoth There is no use of "appended" file list in the description.
If we always build the incremental index, it's not necessary to keep the appended file list.
I think we could have another option (for better hybrid scan) to use both list - refreshIndex(mode='metadata')
(or 'metadataOnly', 'signature', 'forHybrid'..?)
In this way, the user can clearly see which index could be reused at query time. (in case hybrid scan config is disabled)
With this, we could suggest this until we optimize the rank algorithm properly. WDYT??
And for naming, I prefer
"RefreshSourceFilesAction"
and
"MetadataUpdateActionBase"

Yes, I requested @pirz to update the description appropriately since the code is out-of-sync. Long story short, appended files is not going to be useful for incremental index for appends. It will only be useful for hybrid scan - is this correct?

If the list of appended files goes in as part of this PR, can Hybrid scan utilize it? I like refreshIndex(mode='metadata') - @imback82 / @pirz ?

@sezruby Will this mode='metadata' be in addition to smart and quick?

Sorry, but I am a bit confused here. PR description was updated and it was talking about "appended" and "deleted" files, before these comments. Can you please clarify if we are talking about this PR or some other PR?

@rapoth
Copy link
Contributor

rapoth commented Oct 1, 2020

@rapoth There is no use of "appended" file list in the description.
If we always build the incremental index, it's not necessary to keep the appended file list.
I think we could have another option (for better hybrid scan) to use both list - refreshIndex(mode='metadata')
(or 'metadataOnly', 'signature', 'forHybrid'..?)
In this way, the user can clearly see which index could be reused at query time. (in case hybrid scan config is disabled)
With this, we could suggest this until we optimize the rank algorithm properly. WDYT??
And for naming, I prefer
"RefreshSourceFilesAction"
and
"MetadataUpdateActionBase"

Yes, I requested @pirz to update the description appropriately since the code is out-of-sync. Long story short, appended files is not going to be useful for incremental index for appends. It will only be useful for hybrid scan - is this correct?
If the list of appended files goes in as part of this PR, can Hybrid scan utilize it? I like refreshIndex(mode='metadata') - @imback82 / @pirz ?
@sezruby Will this mode='metadata' be in addition to smart and quick?

Sorry, but I am a bit confused here. PR description was updated and it was talking about "appended" and "deleted" files, before these comments. Can you please clarify if we are talking about this PR or some other PR?

Thanks! I'm good from my side. Apparently, I was looking at an alternate version. @sezruby ?

@sezruby
Copy link
Collaborator

sezruby commented Oct 1, 2020

@pirz @rapoth Sorry the description I mentioned is the one in comment #170 (comment)

Yes, 'metadata' is an additional refresh mode to use hybrid scan w/o query time detection of source file list.
It will also allow selective hybrid scan for each index

@rapoth
Copy link
Contributor

rapoth commented Oct 1, 2020

@pirz @rapoth Sorry the description I mentioned is the one in comment #170 (comment)

Ok, last question. I re-read my comment and I did mention appended files.

We handle capturing both deleted and appended files in the index meta-data as it's already implemented in this PR.

@imback82
Copy link
Contributor

imback82 commented Oct 1, 2020

And for naming, I prefer
"RefreshSourceFilesAction"
and
"MetadataUpdateActionBase"

Sorry @sezruby, I missed this. @imback82 please consider this when you are thinking of other renames that may be needed.

Sorry for the late response (busy at work 😄). The names sound reasonable, but how about RefreshSourceMetadataAction (RefreshSourceFilesAction sounds like we are actually refreshing source files - whatever that is).

Now that we are introducing metadata only action, I think we may need to revisit the class hierarchy for refresh. For example, making MetadataUpdateActionBase extend RefreshActionBase (which extends CreateActionBase) doesn't seem like a like good hierarchy. @pirz Can you give a shot at this? If not, I can take a look on Friday. I think we would know better what the name of MetadataUpdateActionBase would be when we rethink on the hierarchy (it could become MetadataUpdateAction as a trait, for example.)

@rapoth
Copy link
Contributor

rapoth commented Oct 1, 2020

And for naming, I prefer
"RefreshSourceFilesAction"
and
"MetadataUpdateActionBase"

Sorry @sezruby, I missed this. @imback82 please consider this when you are thinking of other renames that may be needed.

Sorry for the late response (busy at work 😄). The names sound reasonable, but how about RefreshSourceMetadataAction (RefreshSourceFilesAction sounds like we are actually refreshing source files - whatever that is).

Yes, being explicit about Metadata in the Action name is a good call. Thanks!

@pirz
Copy link
Contributor Author

pirz commented Oct 27, 2020

#188 addressed parts of the changes in this PR. New metadata action will be added as a separate PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

advanced issue This is the tag for advanced issues which involve major design changes or introduction enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants