Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@sezruby
Copy link
Collaborator

@sezruby sezruby commented Oct 12, 2020

What is the context for this pull request?

What changes were proposed in this pull request?

Support Delta lake relation for "immutable" dataset.
This PR covers index creation/refresh(full mode)/application using FilterIndexRule and JoinIndexRule.

As we can rely on Delta Lake for data change and management, the version info and path of Delta Lake relation can be used in FileBasedSignatureProvider - instead of hash computation of all source files:

...
 case LogicalRelation(
          HadoopFsRelation(location: PartitioningAwareFileIndex, _, _, _, _, _),
          _,
          _,
          _) =>
        fingerprint ++= location.allFiles.foldLeft("")(
          (accumulate: String, fileStatus: FileStatus) =>
            HashingUtils.md5Hex(accumulate + getFingerprint(fileStatus)))
 case LogicalRelation(
          HadoopFsRelation(location: TahoeLogFileIndex, _, _, _, _, _),
          _,
          _,
          _) =>
        fingerprint ++= location.tableVersion + location.path.toString
...

Does this PR introduce any user-facing change?

Yes, a user can create/refresh/apply indexes on Delta Lake relation.

How was this patch tested?

Unit test

@sezruby sezruby self-assigned this Oct 12, 2020
@sezruby sezruby added this to the 0.5.0 milestone Oct 12, 2020
@imback82 imback82 modified the milestones: October 2020, November 2020 Oct 13, 2020
@sezruby sezruby added the enhancement New feature or request label Oct 19, 2020
@imback82 imback82 modified the milestones: November 2020, October 2020 Oct 22, 2020
build.sbt Outdated
if (scalaVersion.value == scala211 || sparkVersion.value == "2.4.2") {
"io.delta" %% "delta-core" % "0.6.1"
} else {
"io.delta" %% "delta-core" % "0.7.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do "provided" since few cloud providers come with delta JARs built in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let's not worry about 0.7.0 since it is for Spark 3.0

Comment on lines 130 to 131
.filesForScan(Seq(), Seq(), false)
.files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use matchingFiles or not?

Can you add comment why Seq() is fine for partition/data filters?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matchingFiles seems to be used when there are additional data filters or partition filters. So I think we don't need to use it for now.

And I added partitionFilter.

dataSchema.json,
fileFormatName,
opts)
case LogicalRelation(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is it possible that the location is TahoeBatchFileIndex?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for version range and used in merge/update command internally. I think it would be better not to allow index creation with it to avoid complexity. WDYT? @imback82

@imback82
Copy link
Contributor

Btw, I am working on the extension model so that you can plug in new source formats. I will have the PR up soon so that we can test it using Delta first.

@rapoth rapoth modified the milestones: October 2020, November 2020 Oct 29, 2020
Copy link
Contributor

@apoorvedave1 apoorvedave1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the approach looks fine to me. Thanks @sezruby

@sezruby
Copy link
Collaborator Author

sezruby commented Nov 25, 2020

This PR will be delivered by #265.

@sezruby
Copy link
Collaborator Author

sezruby commented Dec 9, 2020

Closed by #265

@sezruby sezruby closed this Dec 9, 2020
@sezruby sezruby deleted the delta branch April 30, 2021 03:25
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants