-
Notifications
You must be signed in to change notification settings - Fork 116
Support modified source files for refresh incremental mode #207
Conversation
Does it make sense to update |
|
LGTM, Thanks @sezruby |
RefreshDeleteAction only "write" the appendedFile to index log entry. So that would be ok. |
What is "ok" here? I know the current code works, but I am talking in terms of the metadata spec. |
Oh I misread your comment. Yes Seq[FileInfo] instead of Seq[String] would be clearer and usable later. |
| val currentFiles = rels.head.rootPaths | ||
| .flatMap { p => | ||
| Content | ||
| .fromDirectory(path = new Path(p), throwIfNotExists = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imback82 BTW this change is required for this release because these lines look up filesystem directly, not from the refresh df. (though the index log entry & source.Content are generated based on the refresh df)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refreshed df also does the filessystem look up. And since both are doing it on the same rootPaths, wouldn't the result be the same (for non-partitioned data)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but a file can be deleted between 2 look up points.
It may be less chance but possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Could you update the PR description with this bug fix as well for the record? Thanks!
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few minor comments, but generally looking good to me.
@apoorvedave1 Can you take a look at this one as well?
src/test/scala/com/microsoft/hyperspace/index/RefreshIndexTests.scala
Outdated
Show resolved
Hide resolved
src/main/scala/com/microsoft/hyperspace/actions/RefreshActionBase.scala
Outdated
Show resolved
Hide resolved
| val currentFiles = rels.head.rootPaths | ||
| .flatMap { p => | ||
| Content | ||
| .fromDirectory(path = new Path(p), throwIfNotExists = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Could you update the PR description with this bug fix as well for the record? Thanks!
src/test/scala/com/microsoft/hyperspace/index/RefreshIndexTests.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/microsoft/hyperspace/index/RefreshIndexTests.scala
Outdated
Show resolved
Hide resolved
|
|
||
| // TODO: Add test for the scenario where existing deletedFiles and newly deleted | ||
| // files are updated. https://github.com/microsoft/hyperspace/issues/195. | ||
| delFiles ++ previousIndexLogEntry.deletedFiles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we now need to dedupe now that we are considering size, etc. If we move to FileInfo, it will be most likely OK. Same applies appendedFiles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be addressed as a follow-up PR if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #211 if we need to follow up on this.
Thanks, created #210. |
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (few follow up items), thanks @sezruby!
What is the context for this pull request?
Fixes #182
What changes were proposed in this pull request?
Some changes from #192. (created this new PR as lack of permission to push to the branch)
This PR allows to do refresh "incremental" for modified files (same name, but different modified time or size).
RefreshDeleteActionhandles the modified files as deleted files, and also as appended files.RefreshAppendActionhandles the modified files as appended files.This PR also fixes a bug in
RefreshDeleteAction; use the source file list of refreshed df to calculate the deleted files, instead of looking up filesystem directly. 2 different lookup points of "refreshed df" and "deleted files" can cause some correctness issue if a file is deleted between the times.Does this PR introduce any user-facing change?
Yes, a user can refresh with "incremental" mode for modified files which was not allowed before.
How was this patch tested?
Unit test