Merge RefreshAppendAction and RefreshDeleteAction #232

sezruby · 2020-10-28T10:53:20Z

What is the context for this pull request?

Tracking Issue: n/a
Parent Issue: n/a
Dependencies: n/a

What changes were proposed in this pull request?

This PR merges RefreshAppendAction and RefreshDeleteAction as RefreshIncremental, which is required for #198.

Without this, it is hard to calculate index signature value considering deletedFiles and appendedFiles -
as spark.read API doesn't allow to create a DF with non-existing file paths.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

sezruby · 2020-10-28T10:57:03Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshIncrementalAction.scala

+import com.microsoft.hyperspace.telemetry.{AppInfo, HyperspaceEvent, RefreshIncrementalActionEvent}
+
+/**
+ * Action to create indexes on newly arrived data. If the user appends new data to existing,


To reviewers:

This file is copied from RefreshAppendAction.scala and I haven't revised overall comments yet. I'll update the comments later.

I haven't removed RefreshAppendAction.scala and RefreshDeleteAction.scala yet, for reference.
I'll remove them and events later.

@apoorvedave1 @pirz @imback82
So please give a quick comment if the approach is not desirable. Thanks!

This approach looks good. Let's move forward with this!

rapoth · 2020-10-29T01:43:04Z

Without this, it is hard to calculate index signature value considering deletedFiles and appendedFiles -
as spark.read API doesn't allow to create a DF with non-existing file paths.

Could you please add an example that clarifies why it is difficult?

pirz

Thanks @sezruby - The overall approach looks fine to me. I left a few comments, mostly for question/clarification.

pirz · 2020-10-29T05:53:41Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshIncrementalAction.scala

+      "Refresh index is updating index by removing index entries " +
+        s"corresponding to ${deletedFiles.length} deleted source data files.")
+
+    if (appendedFiles.nonEmpty) {


Question: In our current implementation for incremental refresh, we first handle deletes and then extend the index by indexing appended files. Here you have reversed the order. Is there a specific reason for that?

If possible and we first handle deletes and then append, then do we still need to add mode: SaveMode to DataFrameWriterExtensions.saveWithBuckets?

Good point. Previously, with only filenames instead of fileInfos, RefreshDeleteAction is required to be done firstly.
Because if we do RefreshAppendAction first, we cannot distinguish which rows should be removed while RefreshDeleteAction.

Now we don't need to care about the order because we know the list of index data files not including appended data.
I changed the order just because AppendRefresh uses write function which uses overwrite mode to write the data.
(I tried to keep previous impl. mostly)

Either way, mode: SaveMode is required because there are 2 data writings for "delete" and "append" respectively.

I added some code to use "overwrite" mode in case there is no appended files, to make sure everything is clear before the write.
Thanks :)

src/main/scala/com/microsoft/hyperspace/actions/RefreshIncrementalAction.scala

src/main/scala/com/microsoft/hyperspace/index/DataFrameWriterExtensions.scala

imback82

LGTM, thanks @sezruby!

imback82 · 2020-10-29T18:24:46Z

src/main/scala/com/microsoft/hyperspace/actions/RefreshIncrementalAction.scala

+        spark.read
+          .parquet(previousIndexLogEntry.content.files.map(_.toString): _*)
+          .filter(
+            !col(s"${IndexConstants.DATA_FILE_NAME_COLUMN}").isin(deletedFiles.map(_.name): _*))


Suggested change

!col(s"${IndexConstants.DATA_FILE_NAME_COLUMN}").isin(deletedFiles.map(_.name): _*))

!col(IndexConstants.DATA_FILE_NAME_COLUMN).isin(deletedFiles.map(_.name): _*))

imback82 · 2020-10-29T18:35:03Z

src/main/scala/com/microsoft/hyperspace/index/DataFrameWriterExtensions.scala

        numBuckets: Int,
-        bucketByColNames: Seq[String]): Unit = {
+        bucketByColNames: Seq[String],
+        mode: SaveMode = SaveMode.Overwrite): Unit = {


Let's not use default value for the mode and be explicit.

imback82 · 2020-10-29T18:36:15Z

src/main/scala/com/microsoft/hyperspace/telemetry/HyperspaceEvent.scala

-/**
- * Index Refresh Event for appended source files. Emitted when refresh is called on an index
+ * Index Refresh Event for incremental mode. Emitted when refresh is called on an index
 * with config flag set to create index for appended source data files.


This needs to be updated.

imback82 · 2020-10-29T18:42:48Z

src/test/scala/com/microsoft/hyperspace/index/E2EHyperspaceRulesTests.scala

  private def queryPlanHasExpectedRootPaths(
      optimizedPlan: LogicalPlan,
      expectedPaths: Seq[Path]): Boolean = {
+    assert(getAllRootPaths(optimizedPlan) === expectedPaths)


Why do we need this check when the caller is doing assert? For printing diff using ===?

imback82 · 2020-10-29T19:08:00Z

@sezruby I pushed 4f29cda to unblock #234. Feel free to create a separate PR if you don't agree with the changes I pushed. Thanks!

sezruby commented Oct 28, 2020

View reviewed changes

sezruby requested review from apoorvedave1, imback82, pirz and rapoth October 28, 2020 10:57

Merge RefreshAppendAction and RefreshDeleteAction

1a1db3b

sezruby force-pushed the mergeappenddelete branch from ddd4320 to 1a1db3b Compare October 28, 2020 12:42

sezruby added this to the October 2020 milestone Oct 29, 2020

pirz reviewed Oct 29, 2020

View reviewed changes

sezruby self-assigned this Oct 29, 2020

sezruby added the enhancement New feature or request label Oct 29, 2020

sezruby added 2 commits October 29, 2020 18:39

review commit

2a98bfe

fix

0806962

sezruby mentioned this pull request Oct 29, 2020

Add Update structure for appended files and deleted files #235

Merged

imback82 reviewed Oct 29, 2020

View reviewed changes

Address PR comments

4f29cda

imback82 approved these changes Oct 29, 2020

View reviewed changes

imback82 merged commit fa94967 into microsoft:master Oct 29, 2020

sezruby mentioned this pull request Nov 2, 2020

Revisit index metadata for mutable dataset #198

Closed

sezruby deleted the mergeappenddelete branch November 24, 2020 02:47

	!col(s"${IndexConstants.DATA_FILE_NAME_COLUMN}").isin(deletedFiles.map(_.name): _*))
	!col(IndexConstants.DATA_FILE_NAME_COLUMN).isin(deletedFiles.map(_.name): _*))

Merge RefreshAppendAction and RefreshDeleteAction #232

Merge RefreshAppendAction and RefreshDeleteAction #232

Uh oh!

Conversation

sezruby commented Oct 28, 2020

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sezruby Oct 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rapoth commented Oct 29, 2020

Uh oh!

pirz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imback82 commented Oct 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sezruby Oct 28, 2020 •

edited

Loading