Numeric lineage implementation #234

pirz · 2020-10-28T21:11:52Z

What is the context for this pull request?

This PR makes two major changes:

Adds id generation for each source data file and index file.
Changes lineage from file full path (stored as String) into a numeric value (Stored as Long).

For more details on the design and implementation, check below proposal.

Proposal: [PROPOSAL]: Numeric id-based lineage implementation #220
Tracking Issue: Numeric id-based lineage column #200

What changes were proposed in this pull request?

Assign unique ids to source data files and index files

When creating/refreshing an index, Hyperspace generates a unique id per file and use it to track the files.
Combination of (filePath, size, modificationTime) is used as the key to identify a unique file for id generation. Therefore, if content of an existing file changes, Hyperspace treats it as a new file and assigns a new id to it.
This id is stored in FileInfo and shows up in the index metadata:

The id is for internal-use only. Hyperspace tracks the id values for each index and generates them automatically.
For a given index, it is unique per file. If a given file is referred to in the source content of multiple indexes, it could have a different id in each of those indexes.

Change lineage from String to numeric id values

The data type for the lineage column is changed from String to an integral data type and instead of storing full path to source data file as lineage value, we store the id associated to that file as the lineage value.

Change create index plan for lineage

Instead of using a UDF to calculate/fill lineage values, the value of lineage is filled by doing a (broadcast) join between the index DF and a lineage DF that maps each source data file to its file id.

Here is a an example Create Index plan:

InsertIntoHadoopFsRelationCommand PATH_TO_IX_FOLDER, ..., bucket columns: [ColName], sort columns: [ColName], Parquet, ...
+- RepartitionByExpression [ColName], ...
   +- Project [...]
      +- Join Inner, (_data_path#10656 = _data_path#10652)
         :- Project [...]
         :  +- Relation[..., _data_path]
         +- ResolvedHint (broadcast)
            +- LocalRelation [_data_path#10652, _data_file_id#10653L]

Performance evaluation

Robustness and performance of above changes is evaluated on a workload consisting of 37 indexes of various sizes on data generated via TPCDS dsdgen for SF=1000GB and Numeric lineage with Join showed 18% to 23% of performance improvement (depending on source data file physical layout) and 5X less storage bloat when creating indexes comparing to storing full file path as lineage values.

Does this PR introduce any user-facing change?

Yes, it affects index metadata and adds new fields to it.

How was this patch tested?

Existing test cases are modified to adopt the change and are used to verify it.

…tent

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

src/main/scala/com/microsoft/hyperspace/actions/RefreshActionBase.scala

imback82

Did one round of review. Curious to see how @sezruby's comment will change.

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

pirz · 2020-10-31T00:25:32Z

Did one round of review. Curious to see how @sezruby's comment will change.

@imback82 I am not sure I get this part. Can you plz explain which comment you are referring to and what do you expect? thnx

src/main/scala/com/microsoft/hyperspace/actions/RefreshActionBase.scala

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

src/main/scala/com/microsoft/hyperspace/actions/RefreshIncrementalAction.scala

src/main/scala/com/microsoft/hyperspace/index/IndexConstants.scala

src/main/scala/com/microsoft/hyperspace/actions/RefreshActionBase.scala

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

src/main/scala/com/microsoft/hyperspace/actions/RefreshIncrementalAction.scala

src/test/scala/com/microsoft/hyperspace/TestUtils.scala

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

imback82

Overall approach seems fine to me.

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

src/main/scala/com/microsoft/hyperspace/index/IndexConstants.scala

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

imback82 · 2020-11-16T20:20:27Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

+                // For a given file, file id is only meaningful in the context of a given
+                // index. At this point, we do not know which index, if any, would be picked.
+                // Therefore, we simply set the file id to UNKNOWN_FILE_ID.
+                FileInfo(


Not related to this PR, but question: @sezruby can we not use FileInfo here (so that we can remove UNKNOWN_FILE_ID)?

@imback82 FileInfo now has id and id is meaningful in the context of a given index. At this point, we dont really know which index is gonna be picked, if any. So id is really unknown here.

Can't isHybridScanCandidate updated to take in Seq[FileStatus] instead? Seems like we need to change only val commonCnt = inputSourceFiles.count(entry.sourceFileInfoSet.contains) to work with FileStatus? (btw, we don't need to address this in this PR)

Yes it's not necessarily FileInfo and also we can utilize tag feature to keep appended and deleted file info in getCandidateIndexes. I'll create an issue for this. Thanks

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala

imback82 · 2020-11-17T02:11:58Z

@sezruby Any other comments? I plan to do a release on 11/17 PST. Thanks!

sezruby

LGTM Thanks!

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

sezruby · 2020-11-17T03:35:40Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

+                // For a given file, file id is only meaningful in the context of a given
+                // index. At this point, we do not know which index, if any, would be picked.
+                // Therefore, we simply set the file id to UNKNOWN_FILE_ID.
+                FileInfo(


Yes it's not necessarily FileInfo and also we can utilize tag feature to keep appended and deleted file info in getCandidateIndexes. I'll create an issue for this. Thanks

imback82

LGTM (a couple minor comments), thanks @pirz (and @sezruby for review)!

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

imback82 · 2020-11-17T18:51:38Z

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala

+                // For a given file, file id is only meaningful in the context of a given
+                // index. At this point, we do not know which index, if any, would be picked.
+                // Therefore, we simply set the file id to UNKNOWN_FILE_ID.
+                FileInfo(


src/test/scala/com/microsoft/hyperspace/index/IndexManagerTests.scala

src/test/scala/com/microsoft/hyperspace/index/RefreshIndexTests.scala

imback82 · 2020-11-17T19:08:26Z

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala

+/**
+ * Provides functionality to generate unique file ids for files.
+ */
+class FileIdTracker {


This class has many logics to validate and I think we need unit tests for this class. I will create an issue to follow up.

Created #258

Pouria Pirzadeh added 8 commits October 23, 2020 19:03

Change lineage to Numeric

63c6085

numeric lineage

a456e71

Merge branch 'master' into pouriap/NumericLineage

b716129

Merge branch 'master' into pouriap/NumericLineageWithAppDelFilesAsCon…

9b356c9

…tent

removed debug trace

eae5f9f

Add numerical lineage

1ce8b43

rename lineage column

3b7853c

minor format fix

e688286

pirz requested review from apoorvedave1, imback82 and sezruby October 28, 2020 21:12

pirz self-assigned this Oct 28, 2020

pirz added the breaking changes label Oct 28, 2020

pirz added this to the October 2020 milestone Oct 28, 2020

pirz added enhancement New feature or request and removed breaking changes labels Oct 28, 2020

Pouria Pirzadeh added 2 commits October 28, 2020 15:27

fix test

fe2c69a

fix index manager tests

33ccca8

sezruby reviewed Oct 29, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala Outdated Show resolved Hide resolved

imback82 mentioned this pull request Oct 29, 2020

Merge RefreshAppendAction and RefreshDeleteAction #232

Merged

Merge branch 'master' into pouriap/NumericLineageForMerge

e251160

sezruby reviewed Oct 30, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala Outdated Show resolved Hide resolved

src/main/scala/com/microsoft/hyperspace/actions/RefreshActionBase.scala Outdated Show resolved Hide resolved

Merge branch 'master' into pouriap/NumericLineage

20e30f5

imback82 reviewed Oct 30, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/actions/CreateActionBase.scala Outdated Show resolved Hide resolved

Pouria Pirzadeh added 3 commits October 30, 2020 17:04

Code changes for numeric lineage

4a78f1e

fix whitespace

1c48805

fix style issue

af8933e

sezruby reviewed Oct 31, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/actions/RefreshActionBase.scala Outdated Show resolved Hide resolved

rapoth added this to the October 2020 milestone Nov 12, 2020

imback82 reviewed Nov 12, 2020

View reviewed changes

Pouria Pirzadeh added 2 commits November 12, 2020 11:00

Merge branch 'master' into pouriap/NumericLineage

f633408

misc changes

271111a

imback82 reviewed Nov 13, 2020

View reviewed changes

src/main/scala/com/microsoft/hyperspace/index/IndexLogEntry.scala Outdated Show resolved Hide resolved

Pouria Pirzadeh added 5 commits November 13, 2020 11:36

Merge branch 'master' into pouriap/NumericLineage

d99677b

FileIdTracker changes

ca55c53

change fileIdTracker key and remove MaxFileId from IndexLogEntry

d7ccc04

remove MAX_FILE_ID

4d1d689

Make id a required field in FileInfo

ecded32

sezruby reviewed Nov 14, 2020

View reviewed changes

imback82 reviewed Nov 16, 2020

View reviewed changes

misc changes

43c6a11

imback82 reviewed Nov 16, 2020

View reviewed changes

misc changes

9de7c2f

imback82 reviewed Nov 16, 2020

View reviewed changes

Pouria Pirzadeh added 2 commits November 16, 2020 17:21

misc fix

bf91963

drop unnecessary assert

1806194

sezruby reviewed Nov 17, 2020

View reviewed changes

nit fix

073c2ba

imback82 reviewed Nov 17, 2020

View reviewed changes

imback82 mentioned this pull request Nov 17, 2020

Remove duplicate getFileIdTracker() in tests #257

Closed

imback82 reviewed Nov 17, 2020

View reviewed changes

imback82 mentioned this pull request Nov 17, 2020

Write unit tests for FileIdTracker #258

Closed

nit fixes

4a9c708

imback82 approved these changes Nov 17, 2020

View reviewed changes

imback82 merged commit 14d773d into microsoft:master Nov 17, 2020

This was referenced Nov 17, 2020

Numeric id-based lineage column #200

Closed

[PROPOSAL]: Numeric id-based lineage implementation #220

Closed

Numeric lineage implementation #234

Numeric lineage implementation #234

Uh oh!

Conversation

pirz commented Oct 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the context for this pull request?

What changes were proposed in this pull request?

Assign unique ids to source data files and index files

Change lineage from String to numeric id values

Change create index plan for lineage

Performance evaluation

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pirz commented Oct 31, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imback82 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imback82 commented Nov 17, 2020

Uh oh!

sezruby left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pirz commented Oct 28, 2020 •

edited

Loading