Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@apoorvedave1
Copy link
Contributor

@apoorvedave1 apoorvedave1 commented Sep 14, 2020

Implement incremental indexing support for append-only data.

NOTE TO REVIEWERS

Uber Issue: Please follow: #136 for complete details.

This PR depends on PR #142 (Delete support on source data) and #162 (Merge support for directory objects) for major changes. Until the other PR goes in, This will be kept as WIP. It can still be viewed for general idea of how the classes will evolve.

Most files in the dependency PRs can be ignored in this PR, if the reviewer is familiar with them. The new, review-able files with new functionality are as follows:
for Functionality: RefreshIncremental.scala,
for Tests: IndexManagerTests.scala

What this PR does

This feature allows hyperspace to create indexes on newly arrived data. If the user appends new data to existing, pre-indexed data, they can use refresh api to generate indexes only on the additional data.
This index creation will be faster than full refresh because it works only on additional data. This is different for a full refresh where the index is built from scratch on full data.

Algorithm Outline:

  1. identify newly added data files
  2. create new index version on these files
  3. update metadata to reflect the latest snapshot of index. The latest snapshot points to the complete snapshot of the index, including all the old and the newly created index files. The same is true for data files as well.

What changes were proposed in this pull request?

A new RefreshIncremental.scala action class which is built based on RefreshActionBase class. Reviewers can start from this class to understand what data is being indexed, and how the new metadata is being generated to reflect the latest truth of the index.

Why are the changes needed?

To support incremental indexing on just the unindexed data

Does this PR introduce any user-facing change?

Yes, this feature introduces a support for creating (or 'updating') index on newly added data, by creating index only on the new data.

How to enable this feature

This feature is currently hidden behind a flag "spark.hyperspace.index.refresh.append.enabled" which defaults to false. It will be later on supported as the api `refreshIndex(indexName, mode="quick") along with support for other features (e.g. delete) within that api. Please follow #136 for complete details.

How was this patch tested?

@apoorvedave1 apoorvedave1 self-assigned this Sep 14, 2020
@apoorvedave1 apoorvedave1 changed the title [WIP] Incremental Refresh index for append-only dataset [WIP] Add RefreshIncrementalAction to support index creation on newly appended data Sep 15, 2020
@rapoth
Copy link
Contributor

rapoth commented Sep 15, 2020

@apoorvedave1 Can you look at the suggestions I made for #142 as far as the section Are there any user-facing changes? is concerned and update your PR description as appropriate?

@rapoth rapoth added this to the 0.4.0 milestone Sep 15, 2020
@rapoth rapoth added advanced issue This is the tag for advanced issues which involve major design changes or introduction enhancement New feature or request labels Sep 15, 2020
@apoorvedave1 apoorvedave1 marked this pull request as ready for review October 1, 2020 20:51
@apoorvedave1 apoorvedave1 requested review from rapoth and sezruby October 5, 2020 19:14
@apoorvedave1 apoorvedave1 requested a review from sezruby October 6, 2020 18:24
@apoorvedave1 apoorvedave1 requested a review from imback82 October 7, 2020 00:55
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @apoorvedave1!

Copy link
Collaborator

@sezruby sezruby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @apoorvedave1!

@imback82 imback82 merged commit 52eaf32 into microsoft:master Oct 7, 2020
@apoorvedave1 apoorvedave1 deleted the refreshAppend branch October 7, 2020 06:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

advanced issue This is the tag for advanced issues which involve major design changes or introduction enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add RefreshIncrementalAction class to support index creation on newly appended data

4 participants