Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@sezruby
Copy link
Collaborator

@sezruby sezruby commented Oct 13, 2020

What is the context for this pull request?

Fixes #182

What changes were proposed in this pull request?

Some changes from #192. (created this new PR as lack of permission to push to the branch)

This PR allows to do refresh "incremental" for modified files (same name, but different modified time or size).

RefreshDeleteAction handles the modified files as deleted files, and also as appended files.
RefreshAppendAction handles the modified files as appended files.

This PR also fixes a bug in RefreshDeleteAction; use the source file list of refreshed df to calculate the deleted files, instead of looking up filesystem directly. 2 different lookup points of "refreshed df" and "deleted files" can cause some correctness issue if a file is deleted between the times.

Does this PR introduce any user-facing change?

Yes, a user can refresh with "incremental" mode for modified files which was not allowed before.

How was this patch tested?

Unit test

@sezruby sezruby added bug Something isn't working enhancement New feature or request labels Oct 13, 2020
@sezruby sezruby added this to the October 2020 milestone Oct 13, 2020
@sezruby sezruby changed the title Get appended and deleted files based on the metadata for incremental refresh mode Support modified source files for refresh incremental mode Oct 13, 2020
@imback82
Copy link
Contributor

imback82 commented Oct 13, 2020

RefreshDeleteAction handles the modified files as deleted files, and also as appended files.

Does it make sense to update appendedFiles / deletedFiles to be Seq[FileInfo] in index log entry for completeness (esp. if someone is reading the index log entry)?

@pirz
Copy link
Contributor

pirz commented Oct 13, 2020

LGTM, Thanks @sezruby

@sezruby
Copy link
Collaborator Author

sezruby commented Oct 13, 2020

RefreshDeleteAction handles the modified files as deleted files, and also as appended files.

Does it make sense to update appendedFiles / deletedFiles to be Seq[FileInfo] in index log entry for completeness (esp. if someone is reading the index log entry)?

RefreshDeleteAction only "write" the appendedFile to index log entry. So that would be ok.

@imback82
Copy link
Contributor

RefreshDeleteAction only "write" the appendedFile to index log entry. So that would be ok.

What is "ok" here?

I know the current code works, but I am talking in terms of the metadata spec.

@sezruby
Copy link
Collaborator Author

sezruby commented Oct 13, 2020

RefreshDeleteAction only "write" the appendedFile to index log entry. So that would be ok.

What is "ok" here?

I know the current code works, but I am talking in terms of the metadata spec.

Oh I misread your comment. Yes Seq[FileInfo] instead of Seq[String] would be clearer and usable later.

val currentFiles = rels.head.rootPaths
.flatMap { p =>
Content
.fromDirectory(path = new Path(p), throwIfNotExists = true)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imback82 BTW this change is required for this release because these lines look up filesystem directly, not from the refresh df. (though the index log entry & source.Content are generated based on the refresh df)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refreshed df also does the filessystem look up. And since both are doing it on the same rootPaths, wouldn't the result be the same (for non-partitioned data)?

Copy link
Collaborator Author

@sezruby sezruby Oct 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but a file can be deleted between 2 look up points.
It may be less chance but possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Could you update the PR description with this bug fix as well for the record? Thanks!

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor comments, but generally looking good to me.

@apoorvedave1 Can you take a look at this one as well?

val currentFiles = rels.head.rootPaths
.flatMap { p =>
Content
.fromDirectory(path = new Path(p), throwIfNotExists = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Could you update the PR description with this bug fix as well for the record? Thanks!


// TODO: Add test for the scenario where existing deletedFiles and newly deleted
// files are updated. https://github.com/microsoft/hyperspace/issues/195.
delFiles ++ previousIndexLogEntry.deletedFiles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we now need to dedupe now that we are considering size, etc. If we move to FileInfo, it will be most likely OK. Same applies appendedFiles

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be addressed as a follow-up PR if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #211 if we need to follow up on this.

@imback82
Copy link
Contributor

RefreshDeleteAction only "write" the appendedFile to index log entry. So that would be ok.

What is "ok" here?
I know the current code works, but I am talking in terms of the metadata spec.

Oh I misread your comment. Yes Seq[FileInfo] instead of Seq[String] would be clearer and usable later.

Thanks, created #210.

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (few follow up items), thanks @sezruby!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve RefreshIncremental logic to take last modified time of files and file size into account.

4 participants