-
Notifications
You must be signed in to change notification settings - Fork 116
Add Update structure for appended files and deleted files #235
Conversation
| } | ||
| object LogicalPlanFingerprint { | ||
| case class Properties(signatures: Seq[Signature]) | ||
| case class Properties(signatures: Seq[Signature], planSignature: String) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imback82 @pirz @apoorvedave1
I wonder this is good approach or no. We need to keep planSignature separately for signature calculation at refreshing - though only logical relation plan is allowed for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this seems a bit weird. :) Can you point me where this is being used? I couldn't find.
One way to get around (depending on the use case), we can get rid of IndexSignatureProvider and just use two separate signatures: FileBasedSignatureProvider and PlanSignatureProvider. And we can have a utils class that combines the signature values on the rule path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imback82 Outdated but for use case: https://github.com/sezruby/hyperspace/blob/metaupdate/src/main/scala/com/microsoft/hyperspace/actions/RefreshAppendAction.scala#L100
https://github.com/sezruby/hyperspace/blob/metaupdate/src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala#L70
So do you suggest signatures defined as Seq[Signature] for multiple relations.Seq[Seq[Signature]]?
This is for quick refresh, so no use case for now. Can I create RefreshQuickAction.scala for this w/o exposing API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
signaturesdefined asSeq[Signature]for multiple relations. So do you suggestSeq[Seq[Signature]]?
Does it make sense to move fingerprint: LogicalPlanFingerprint from SparkPlan to Relation so that the fingerprint always goes with the corresponding relation. (And we can break up IndexSignatureProvider). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apoorvedave1 / @pirz Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refresh Quick mode
- collect appendedFiles, deletedFiles, latestSignature
- latestSignature can be calculated by planSignature value + updated file based signature value
- current behavior at refresh (full and incremental):
- use reconstructed
dfto get planSignature
- use reconstructed
Seems we could use the reconstructed df in the same way of the other refresh modes, instead of storing planSignature value separately.
Do you think it is okay to keep the current structure for signature values in IndexLogEntry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this PR w/o planSignature change and added simple RefreshQuickAction impl for use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imback82 @priz @apoorvedave1 Could you give any quick comment on this?
BTW how about adding stats: Map[String, String] in CoveringIndex.properties that @imback82 mentioned in the previous discussion for hasLineage config? it's just to avoid breaking changes in the next release expecting more statistics / usability changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sezruby for RefreshQuickAction. The change looks good to me. I left a comment whether we can use LogicalPlanFingerprint. If it works out, could you separate out RefreshQuickAction to a separate PR to better track?
BTW how about adding
stats: Map[String, String]in CoveringIndex.properties that @imback82 mentioned in the previous discussion forhasLineageconfig? it's just to avoid breaking changes in the next release expecting more statistics / usability changes.
There may be an option to update ObjectMapper to allow adding new fields while maintaining compatibility? I am also fine with adding one now. Would this be a "stat" or more like "properties"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be an option to update ObjectMapper to allow adding new fields while maintaining compatibility? I am also fine with adding one now.
Seems no problem to add new fields later. Let's handle it later and wrap up this release 👍
Would this be a "stat" or more like "properties"?
I suggested stats as it will be added under CoveringIndex.Properties. How about extra or aux ..?
| } | ||
|
|
||
| /** | ||
| * Updated source data since lst time derived dataset was updated. Note that index data does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Updated source data since lst time derived dataset was updated. Note that index data does not | |
| * Updated source data since last time derived dataset was updated. Note that index data does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, Update is used under Hdfs which is a generic representation of file system. We can make a generic comment here and move this comment to Relation to explain data: Hdfs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment. Thanks!
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
| // FileInfo's 'name' contains the full path to the file. | ||
| def deletedFiles: Set[FileInfo] = { | ||
| sourceUpdate.flatMap(_.deletedFiles).map(_.fileInfos).getOrElse(Set()) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the following be updated or not (can be done as a follow up PR)?
hyperspace/src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala
Lines 108 to 115 in 86174a9
| // index.deletedFiles and index.appendedFiles should be non-empty until Hybrid Scan | |
| // handles the lists properly. Otherwise, as the source file list of each index entry | |
| // (entry.allSourceFileInfo) also contains the appended and deleted files, we cannot | |
| // get the actual appended files and deleted files correctly. | |
| indexes.filter( | |
| index => | |
| index.created && index.deletedFiles.isEmpty && index.appendedFiles.isEmpty && | |
| isHybridScanCandidate(index, filesByRelations.flatten)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's required until
- RefreshQuickAction & refresh mode api
- getCandidateIndex fix
- compare with the latest signature if exists
if so, tag hybrid scan required = true
- transformPlanToUseIndex & hybrid scan fix
- use appendedFiles and deletedFiles if exist
src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala
Outdated
Show resolved
Hide resolved
| * @param latestSignatures Latest signatures including appendedFiles and deletedFiles. | ||
| */ | ||
| case class Update( | ||
| latestSignatures: Seq[Signature], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this field, and can we make this as fingerprint: LogicalPlanFingerprint? This goes against my previous comment about making this Update generic. Let me explain.
Whenever you see "kind" and "properties", that structure is supposed to be a "plug-in" point. For example, the "Source" has "SparkPlan", but it can be replaced with different sources "SqlDW", etc. It's easier to see if you look at the JSON representation. We never got a chance to work on the parser to support these scenarios and resorted to the case class to make reading/writing JSON easy for the time being.
Back to Update, I think the spec owner can decide what to do with the fingerprint for Update. In our case, it will be a LogicalPlanFingerprint. This also aligns with Content having a fingerprint. Does this make sense? We can update the comment as follows:
/**
* Captures any HDFS updates.
*
* @note 'fingerprint' shouldn't be tied to [[LogicalPlanFingerprint]], but it's a free-form
* class, meaning it has "kind" and "properties" so that different classes can be plugged in.
* Thus, 'fingerprint' is still a generic field in terms of the metadata log specification.
*
* @param fingerprint Fingerprint of the update.
* @param appendedFiles Appended files.
* @param deletedFiles Deleted files.
*/
case class Update(
fingerprint: LogicalPlanFingerprint,
appendedFiles: Option[Content] = None,
deletedFiles: Option[Content] = None)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Thanks for the suggestion :)
| * IndexLogEntry-specific Relation that represents the source relation. | ||
| * | ||
| * @param rootPaths List of root paths for the source relation. | ||
| * @param data Source data since last time derived dataset was updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be mentioning the Update field?
How about
@param data Source data for the relation.
Hdfs.properties.content captures source data which derived dataset was created from.
Hdfs.properties.update captures any updates since the derived dataset was created.
Note that Hdfs.properties.update.fingerprint is a fingerprint created from the latest source data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sezruby Writing this comment, Update.fingerprint still feels out of place. :(
What if we update SparkPlan.Properties.fingerprint to always point to the latest fingerprint (instead of having fingerprint inside Update)?
imback82
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @sezruby! Could you update the PR description?
What is the context for this pull request?
This change is also required for #169, refresh quick mode.
Fixes #169
What changes were proposed in this pull request?
For #198, we need to keep the latest signature value leveraging appended and deleted files.
This PR defines
Updateclass for newly updated data.The latest signature value will be replaced with outdated one in RefreshQuickAction.
You can check the use case in #238
Does this PR introduce any user-facing change?
Yes, this is a breakage change, metadata change, so need to rebuild their index.
How was this patch tested?