Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@apoorvedave1
Copy link
Contributor

@apoorvedave1 apoorvedave1 commented Oct 7, 2020

NOTES TO REVIEWERS:

  1. Index metadata changes from another PR Update index log entry for enforce delete during read time #170 by @pirz have been handpicked without which correctness cannot be guaranteed for this PR.
    Changes include:

What is the context for this pull request?

Hyperspace currently handles append support and delete support on modifiable source data separately. Users can either allow deletion of some source files and refresh their index, or they can allow adding files to data and refresh the index.

This PR merges these operations which will enable users to have a fully modifyable data source. Users can delete some of the indexed source files and also append some new source files. A refreshIndex(indexName, mode = "incremental") operation will allow users to update their indexes to accommodate for all these changes.

What changes were proposed in this pull request?

1

Merge the behavior of RefreshDeleteAction and RefreshAppendAction in a single user facing api refreshIndex(indexName, mode = "incremental"). When the user calls this api, hyperspace should update the index by removing records from all the deleted source files, and adding records from the newly added source files.

2 PR #188 MERGED

IndexLogEntry will support storing "appended" and "deleted" files (these changes were picked from #170 for correctness).

3 PR #191 MERGED

Update RefreshDeleteAction and RefreshAppendAction to also update the "appended" and "deleted" files list in index metadata.

4 PR #189 MERGED

If no updates to data source are found, it will NOT throw an exception anymore. This means:

  • If no deleted files are found, log a warning message and exit the RefreshDeleteAction without throwing exception.
  • If no appended files are found, log a warning message and exit the RefreshAppendAction without throwing exception.

5 PR #194 MERGED

Bug fix: if 'appended' or 'deleted' files are present in index metadata and hybrid scan is disabled, do NOT use those indexes. Otherwise this will lead to incorrect results.

Does this PR introduce any user-facing change?

Yes.

  • This PR introduces refreshIndex(indexName, mode = "incremental") for the users. Users can choose this over mode="full" for a faster refresh of index files.
  • index metadata files now contain "appended" and "deleted" list of files from original source data. These files are those for which index has not been updated.

How was this patch tested?

added unit tests where we delete some files from original indexed data, followed by appending new data to original data. A refresh(indexName, mode = "incremental") will update the indexes such that newly created index files will reflect the most recent version of the data.

Pouria Pirzadeh and others added 30 commits August 25, 2020 22:07
Co-authored-by: Rahul Potharaju <rapoth@microsoft.com>
RefreshIncremental.scala is for the same purpose.
# Conflicts:
#	src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
#	src/test/scala/com/microsoft/hyperspace/index/IndexLogEntryTest.scala
# Conflicts:
#	src/main/scala/com/microsoft/hyperspace/actions/RefreshAction.scala
#	src/main/scala/com/microsoft/hyperspace/actions/RefreshActionBase.scala
#	src/main/scala/com/microsoft/hyperspace/actions/RefreshDeleteAction.scala
#	src/main/scala/com/microsoft/hyperspace/index/IndexCollectionManager.scala
#	src/main/scala/com/microsoft/hyperspace/index/IndexConstants.scala
#	src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTests.scala
…s.scala

Co-authored-by: Rahul Potharaju <rapoth@microsoft.com>
…s.scala

Co-authored-by: Rahul Potharaju <rapoth@microsoft.com>
…s.scala

Co-authored-by: Rahul Potharaju <rapoth@microsoft.com>
Copy link
Collaborator

@sezruby sezruby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't forget to handle this in 0.4: #187 (comment)

@apoorvedave1
Copy link
Contributor Author

Please don't forget to handle this in 0.4: #187 (comment)

Thanks for the reminders. I prefer not to take this in this PR

@apoorvedave1
Copy link
Contributor Author

apoorvedave1 commented Oct 10, 2020

Please don't forget to handle this in 0.4: #187 (comment)

Thanks for the reminders. I prefer not to take this in this PR

Could you please create an issue?
Nvm. I just realized this is already known. I created #196 for tracking and added lineage requirement to the issue but please check:
https://github.com/microsoft/hyperspace/blob/master/src/main/scala/com/microsoft/hyperspace/actions/RefreshAction.scala#L40-L42
and
https://github.com/microsoft/hyperspace/blob/master/src/main/scala/com/microsoft/hyperspace/actions/RefreshAppendAction.scala#L44-L48

}

test("Verify Join Index Rule utilizes indexes correctly after incremental refresh.") {
test("Verify JoinIndexRule utilizes indexes correctly after incremental refresh (append-only).") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: this is a preexsisting test with only a REFRESH_APPEND_ENABLED flag removed

@apoorvedave1 apoorvedave1 requested a review from sezruby October 10, 2020 20:24
@imback82
Copy link
Contributor

Can you or @thrajput also update the Python binding?

@AFFogarty can help on the C# side once 0.3 is released. (I will ping you @AFFogarty 😄 when it's ready)

@apoorvedave1 apoorvedave1 requested a review from imback82 October 12, 2020 19:38
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @apoorvedave1!

I will wait others to approve as well before merging.

@imback82
Copy link
Contributor

Merging this as no responses from @sezruby and @pirz. Feel free to make comments if you have any.

@imback82 imback82 merged commit 5f9a3e5 into microsoft:master Oct 13, 2020
@apoorvedave1 apoorvedave1 deleted the merge branch March 16, 2021 22:40
@apoorvedave1 apoorvedave1 restored the merge branch March 16, 2021 22:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

advanced issue This is the tag for advanced issues which involve major design changes or introduction enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RefreshDelete should update appended files list in metadata

3 participants