Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@sezruby
Copy link
Collaborator

@sezruby sezruby commented Oct 28, 2020

What is the context for this pull request?

  • Tracking Issue: n/a
  • Parent Issue: n/a
  • Dependencies: n/a

What changes were proposed in this pull request?

This PR merges RefreshAppendAction and RefreshDeleteAction as RefreshIncremental, which is required for #198.

Without this, it is hard to calculate index signature value considering deletedFiles and appendedFiles -
as spark.read API doesn't allow to create a DF with non-existing file paths.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

import com.microsoft.hyperspace.telemetry.{AppInfo, HyperspaceEvent, RefreshIncrementalActionEvent}

/**
* Action to create indexes on newly arrived data. If the user appends new data to existing,
Copy link
Collaborator Author

@sezruby sezruby Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reviewers:

  1. This file is copied from RefreshAppendAction.scala and I haven't revised overall comments yet. I'll update the comments later.
  2. I haven't removed RefreshAppendAction.scala and RefreshDeleteAction.scala yet, for reference.
    I'll remove them and events later.

@apoorvedave1 @pirz @imback82
So please give a quick comment if the approach is not desirable. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks good. Let's move forward with this!

@sezruby sezruby added this to the October 2020 milestone Oct 29, 2020
@rapoth
Copy link
Contributor

rapoth commented Oct 29, 2020

Without this, it is hard to calculate index signature value considering deletedFiles and appendedFiles -
as spark.read API doesn't allow to create a DF with non-existing file paths.

Could you please add an example that clarifies why it is difficult?

Copy link
Contributor

@pirz pirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sezruby - The overall approach looks fine to me. I left a few comments, mostly for question/clarification.

"Refresh index is updating index by removing index entries " +
s"corresponding to ${deletedFiles.length} deleted source data files.")

if (appendedFiles.nonEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: In our current implementation for incremental refresh, we first handle deletes and then extend the index by indexing appended files. Here you have reversed the order. Is there a specific reason for that?

If possible and we first handle deletes and then append, then do we still need to add mode: SaveMode to DataFrameWriterExtensions.saveWithBuckets?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Previously, with only filenames instead of fileInfos, RefreshDeleteAction is required to be done firstly.
Because if we do RefreshAppendAction first, we cannot distinguish which rows should be removed while RefreshDeleteAction.

Now we don't need to care about the order because we know the list of index data files not including appended data.
I changed the order just because AppendRefresh uses write function which uses overwrite mode to write the data.
(I tried to keep previous impl. mostly)

Either way, mode: SaveMode is required because there are 2 data writings for "delete" and "append" respectively.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some code to use "overwrite" mode in case there is no appended files, to make sure everything is clear before the write.
Thanks :)

@sezruby sezruby self-assigned this Oct 29, 2020
@sezruby sezruby added the enhancement New feature or request label Oct 29, 2020
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @sezruby!

spark.read
.parquet(previousIndexLogEntry.content.files.map(_.toString): _*)
.filter(
!col(s"${IndexConstants.DATA_FILE_NAME_COLUMN}").isin(deletedFiles.map(_.name): _*))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
!col(s"${IndexConstants.DATA_FILE_NAME_COLUMN}").isin(deletedFiles.map(_.name): _*))
!col(IndexConstants.DATA_FILE_NAME_COLUMN).isin(deletedFiles.map(_.name): _*))

numBuckets: Int,
bucketByColNames: Seq[String]): Unit = {
bucketByColNames: Seq[String],
mode: SaveMode = SaveMode.Overwrite): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use default value for the mode and be explicit.

/**
* Index Refresh Event for appended source files. Emitted when refresh is called on an index
* Index Refresh Event for incremental mode. Emitted when refresh is called on an index
* with config flag set to create index for appended source data files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated.

private def queryPlanHasExpectedRootPaths(
optimizedPlan: LogicalPlan,
expectedPaths: Seq[Path]): Boolean = {
assert(getAllRootPaths(optimizedPlan) === expectedPaths)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this check when the caller is doing assert? For printing diff using ===?

@imback82
Copy link
Contributor

@sezruby I pushed 4f29cda to unblock #234. Feel free to create a separate PR if you don't agree with the changes I pushed. Thanks!

@imback82 imback82 merged commit fa94967 into microsoft:master Oct 29, 2020
@sezruby sezruby deleted the mergeappenddelete branch November 24, 2020 02:47
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants