Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@pirz
Copy link
Contributor

@pirz pirz commented Oct 28, 2020

What is the context for this pull request?

This PR makes two major changes:

  • Adds id generation for each source data file and index file.
  • Changes lineage from file full path (stored as String) into a numeric value (Stored as Long).

For more details on the design and implementation, check below proposal.

What changes were proposed in this pull request?

Assign unique ids to source data files and index files

When creating/refreshing an index, Hyperspace generates a unique id per file and use it to track the files.
Combination of (filePath, size, modificationTime) is used as the key to identify a unique file for id generation. Therefore, if content of an existing file changes, Hyperspace treats it as a new file and assigns a new id to it.
This id is stored in FileInfo and shows up in the index metadata:

  • The id is for internal-use only. Hyperspace tracks the id values for each index and generates them automatically.
  • For a given index, it is unique per file. If a given file is referred to in the source content of multiple indexes, it could have a different id in each of those indexes.

Change lineage from String to numeric id values

The data type for the lineage column is changed from String to an integral data type and instead of storing full path to source data file as lineage value, we store the id associated to that file as the lineage value.

Change create index plan for lineage

Instead of using a UDF to calculate/fill lineage values, the value of lineage is filled by doing a (broadcast) join between the index DF and a lineage DF that maps each source data file to its file id.

Here is a an example Create Index plan:

InsertIntoHadoopFsRelationCommand PATH_TO_IX_FOLDER, ..., bucket columns: [ColName], sort columns: [ColName], Parquet, ...
+- RepartitionByExpression [ColName], ...
   +- Project [...]
      +- Join Inner, (_data_path#10656 = _data_path#10652)
         :- Project [...]
         :  +- Relation[..., _data_path]
         +- ResolvedHint (broadcast)
            +- LocalRelation [_data_path#10652, _data_file_id#10653L]

Performance evaluation

Robustness and performance of above changes is evaluated on a workload consisting of 37 indexes of various sizes on data generated via TPCDS dsdgen for SF=1000GB and Numeric lineage with Join showed 18% to 23% of performance improvement (depending on source data file physical layout) and 5X less storage bloat when creating indexes comparing to storing full file path as lineage values.

Does this PR introduce any user-facing change?

Yes, it affects index metadata and adds new fields to it.

How was this patch tested?

Existing test cases are modified to adopt the change and are used to verify it.

@pirz pirz self-assigned this Oct 28, 2020
@pirz pirz added this to the October 2020 milestone Oct 28, 2020
@pirz pirz added enhancement New feature or request and removed breaking changes labels Oct 28, 2020
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did one round of review. Curious to see how @sezruby's comment will change.

@pirz
Copy link
Contributor Author

pirz commented Oct 31, 2020

Did one round of review. Curious to see how @sezruby's comment will change.

@imback82 I am not sure I get this part. Can you plz explain which comment you are referring to and what do you expect? thnx

@rapoth rapoth added this to the October 2020 milestone Nov 12, 2020
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach seems fine to me.

// For a given file, file id is only meaningful in the context of a given
// index. At this point, we do not know which index, if any, would be picked.
// Therefore, we simply set the file id to UNKNOWN_FILE_ID.
FileInfo(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but question: @sezruby can we not use FileInfo here (so that we can remove UNKNOWN_FILE_ID)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imback82 FileInfo now has id and id is meaningful in the context of a given index. At this point, we dont really know which index is gonna be picked, if any. So id is really unknown here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't isHybridScanCandidate updated to take in Seq[FileStatus] instead? Seems like we need to change only val commonCnt = inputSourceFiles.count(entry.sourceFileInfoSet.contains) to work with FileStatus? (btw, we don't need to address this in this PR)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's not necessarily FileInfo and also we can utilize tag feature to keep appended and deleted file info in getCandidateIndexes. I'll create an issue for this. Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@imback82
Copy link
Contributor

@sezruby Any other comments? I plan to do a release on 11/17 PST. Thanks!

Copy link
Collaborator

@sezruby sezruby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks!

// For a given file, file id is only meaningful in the context of a given
// index. At this point, we do not know which index, if any, would be picked.
// Therefore, we simply set the file id to UNKNOWN_FILE_ID.
FileInfo(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's not necessarily FileInfo and also we can utilize tag feature to keep appended and deleted file info in getCandidateIndexes. I'll create an issue for this. Thanks

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (a couple minor comments), thanks @pirz (and @sezruby for review)!

// For a given file, file id is only meaningful in the context of a given
// index. At this point, we do not know which index, if any, would be picked.
// Therefore, we simply set the file id to UNKNOWN_FILE_ID.
FileInfo(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

/**
* Provides functionality to generate unique file ids for files.
*/
class FileIdTracker {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class has many logics to validate and I think we need unit tests for this class. I will create an issue to follow up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #258

@imback82 imback82 merged commit 14d773d into microsoft:master Nov 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants