Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@sezruby
Copy link
Collaborator

@sezruby sezruby commented Oct 29, 2020

What is the context for this pull request?

This change is also required for #169, refresh quick mode.
Fixes #169

What changes were proposed in this pull request?

For #198, we need to keep the latest signature value leveraging appended and deleted files.
This PR defines Update class for newly updated data.

case class Update(
  appendedFiles: Option[Content] = None,
  deletedFiles: Option[Content] = None)

The latest signature value will be replaced with outdated one in RefreshQuickAction.
You can check the use case in #238

Does this PR introduce any user-facing change?

Yes, this is a breakage change, metadata change, so need to rebuild their index.

How was this patch tested?

@sezruby sezruby changed the title Add Update structure for appended and deleted files and latest signature. Add Update structure for appended and deleted files and its signature. Oct 29, 2020
@sezruby sezruby changed the title Add Update structure for appended and deleted files and its signature. Add Update structure for appended and deleted files and its signature Oct 29, 2020
@sezruby sezruby added this to the October 2020 milestone Oct 29, 2020
@sezruby sezruby self-assigned this Oct 29, 2020
}
object LogicalPlanFingerprint {
case class Properties(signatures: Seq[Signature])
case class Properties(signatures: Seq[Signature], planSignature: String)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imback82 @pirz @apoorvedave1
I wonder this is good approach or no. We need to keep planSignature separately for signature calculation at refreshing - though only logical relation plan is allowed for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this seems a bit weird. :) Can you point me where this is being used? I couldn't find.

One way to get around (depending on the use case), we can get rid of IndexSignatureProvider and just use two separate signatures: FileBasedSignatureProvider and PlanSignatureProvider. And we can have a utils class that combines the signature values on the rule path.

Copy link
Collaborator Author

@sezruby sezruby Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imback82 Outdated but for use case: https://github.com/sezruby/hyperspace/blob/metaupdate/src/main/scala/com/microsoft/hyperspace/actions/RefreshAppendAction.scala#L100
https://github.com/sezruby/hyperspace/blob/metaupdate/src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala#L70

signatures defined as Seq[Signature] for multiple relations. So do you suggest Seq[Seq[Signature]]?

This is for quick refresh, so no use case for now. Can I create RefreshQuickAction.scala for this w/o exposing API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signatures defined as Seq[Signature] for multiple relations. So do you suggest Seq[Seq[Signature]]?

Does it make sense to move fingerprint: LogicalPlanFingerprint from SparkPlan to Relation so that the fingerprint always goes with the corresponding relation. (And we can break up IndexSignatureProvider). WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apoorvedave1 / @pirz Any suggestions?

Copy link
Collaborator Author

@sezruby sezruby Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refresh Quick mode

  • collect appendedFiles, deletedFiles, latestSignature
  • latestSignature can be calculated by planSignature value + updated file based signature value
  • current behavior at refresh (full and incremental):
    • use reconstructed df to get planSignature

Seems we could use the reconstructed df in the same way of the other refresh modes, instead of storing planSignature value separately.
Do you think it is okay to keep the current structure for signature values in IndexLogEntry?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this PR w/o planSignature change and added simple RefreshQuickAction impl for use case.

Copy link
Collaborator Author

@sezruby sezruby Oct 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imback82 @priz @apoorvedave1 Could you give any quick comment on this?

BTW how about adding stats: Map[String, String] in CoveringIndex.properties that @imback82 mentioned in the previous discussion for hasLineage config? it's just to avoid breaking changes in the next release expecting more statistics / usability changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sezruby for RefreshQuickAction. The change looks good to me. I left a comment whether we can use LogicalPlanFingerprint. If it works out, could you separate out RefreshQuickAction to a separate PR to better track?

BTW how about adding stats: Map[String, String] in CoveringIndex.properties that @imback82 mentioned in the previous discussion for hasLineage config? it's just to avoid breaking changes in the next release expecting more statistics / usability changes.

There may be an option to update ObjectMapper to allow adding new fields while maintaining compatibility? I am also fine with adding one now. Would this be a "stat" or more like "properties"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be an option to update ObjectMapper to allow adding new fields while maintaining compatibility? I am also fine with adding one now.

Seems no problem to add new fields later. Let's handle it later and wrap up this release 👍

Would this be a "stat" or more like "properties"?

I suggested stats as it will be added under CoveringIndex.Properties. How about extra or aux ..?

@sezruby sezruby changed the title Add Update structure for appended and deleted files and its signature Add Update structure for appended files and deleted files Oct 29, 2020
}

/**
* Updated source data since lst time derived dataset was updated. Note that index data does not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Updated source data since lst time derived dataset was updated. Note that index data does not
* Updated source data since last time derived dataset was updated. Note that index data does not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, Update is used under Hdfs which is a generic representation of file system. We can make a generic comment here and move this comment to Relation to explain data: Hdfs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment. Thanks!

// FileInfo's 'name' contains the full path to the file.
def deletedFiles: Set[FileInfo] = {
sourceUpdate.flatMap(_.deletedFiles).map(_.fileInfos).getOrElse(Set())
}
Copy link
Contributor

@imback82 imback82 Nov 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the following be updated or not (can be done as a follow up PR)?

// index.deletedFiles and index.appendedFiles should be non-empty until Hybrid Scan
// handles the lists properly. Otherwise, as the source file list of each index entry
// (entry.allSourceFileInfo) also contains the appended and deleted files, we cannot
// get the actual appended files and deleted files correctly.
indexes.filter(
index =>
index.created && index.deletedFiles.isEmpty && index.appendedFiles.isEmpty &&
isHybridScanCandidate(index, filesByRelations.flatten))

Copy link
Collaborator Author

@sezruby sezruby Nov 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's required until

  • RefreshQuickAction & refresh mode api
  • getCandidateIndex fix
    • compare with the latest signature if exists
    • if so, tag hybrid scan required = true
  • transformPlanToUseIndex & hybrid scan fix
    • use appendedFiles and deletedFiles if exist

* @param latestSignatures Latest signatures including appendedFiles and deletedFiles.
*/
case class Update(
latestSignatures: Seq[Signature],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this field, and can we make this as fingerprint: LogicalPlanFingerprint? This goes against my previous comment about making this Update generic. Let me explain.

Whenever you see "kind" and "properties", that structure is supposed to be a "plug-in" point. For example, the "Source" has "SparkPlan", but it can be replaced with different sources "SqlDW", etc. It's easier to see if you look at the JSON representation. We never got a chance to work on the parser to support these scenarios and resorted to the case class to make reading/writing JSON easy for the time being.

Back to Update, I think the spec owner can decide what to do with the fingerprint for Update. In our case, it will be a LogicalPlanFingerprint. This also aligns with Content having a fingerprint. Does this make sense? We can update the comment as follows:

/**
 * Captures any HDFS updates.
 *
 * @note 'fingerprint' shouldn't be tied to [[LogicalPlanFingerprint]], but it's a free-form
 *       class, meaning it has "kind" and "properties" so that different classes can be plugged in.
 *       Thus, 'fingerprint' is still a generic field in terms of the metadata log specification.
 *
 * @param fingerprint Fingerprint of the update.
 * @param appendedFiles Appended files.
 * @param deletedFiles Deleted files.
 */
case class Update(
  fingerprint: LogicalPlanFingerprint,
  appendedFiles: Option[Content] = None,
  deletedFiles: Option[Content] = None)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Thanks for the suggestion :)

* IndexLogEntry-specific Relation that represents the source relation.
*
* @param rootPaths List of root paths for the source relation.
* @param data Source data since last time derived dataset was updated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be mentioning the Update field?
How about

@param data Source data for the relation.
            Hdfs.properties.content captures source data which derived dataset was created from.
            Hdfs.properties.update captures any updates since the derived dataset was created. 
            Note that Hdfs.properties.update.fingerprint is a fingerprint created from the latest source data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sezruby Writing this comment, Update.fingerprint still feels out of place. :(

What if we update SparkPlan.Properties.fingerprint to always point to the latest fingerprint (instead of having fingerprint inside Update)?

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @sezruby! Could you update the PR description?

@imback82 imback82 merged commit 048d875 into microsoft:master Nov 2, 2020
@sezruby sezruby deleted the metaupdate2 branch November 24, 2020 02:47
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add index metadata update in RefreshIndex for quick mode

2 participants