Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@sezruby
Copy link
Collaborator

@sezruby sezruby commented Nov 23, 2020

What is the context for this pull request?

What changes were proposed in this pull request?

Refactor #197 + #224 (except for hybrid scan test refactoring) based on #227.

This PR introduces DeltaLakeFileBasedSource to support indexes on Delta Lake Sources.

Does this PR introduce any user-facing change?

Yes,

Users can create/refresh indexes on Delta Lake table and also utilize Hybrid Scan.

How was this patch tested?

Unit test

@sezruby
Copy link
Collaborator Author

sezruby commented Nov 23, 2020

@imback82 Need to add some comments but could you review this PR?
I'll handle Hybrid Scan test refactoring in #224 with another PR.

@imback82
Copy link
Contributor

@imback82 Need to add some comments but could you review this PR?

Cool! I will take a look.

I'll handle Hybrid Scan test refactoring in #224 with another PR.

Yes, I believe Hybrid Scan requires new APIs, so let's handle it separately. Thanks!

@imback82
Copy link
Contributor

DeltaLakeFileBasedSource looks good to me. If you are OK with overall approach with #227, I can merge it and resolve comments (unit tests) as a follow up PRs.

Btw, can you check the build failure?

@sezruby
Copy link
Collaborator Author

sezruby commented Nov 23, 2020

Yes, I believe Hybrid Scan requires new APIs, so let's handle it separately. Thanks!

@imback82 I added allFiles API for Hybrid Scan and seems that's all for Hybrid Scan.

BTW #227 also has the same test failure and it's not reproducible on my local. Let me check further.

@imback82
Copy link
Contributor

#227 has been merged. Thanks!

.format(latestRelation.fileFormat)
.options(latestRelation.options)
.load(latestRelation.rootPaths: _*)
if (latestRelation.rootPaths.size == 1) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delta Lake only allows one path in load()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we can do something like:

val df = spark.read
  .schema(dataSchema)
  .format(latestRelation.fileFormat)
  .options(latestRelation.options)
// Due to the difference in how the "path" option is set: https://github.com/apache/spark/blob/ef1441b56c5cab02335d8d2e4ff95cf7e9c9b9ca/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L197,
// load() with a single parameter needs to be handled differently.
if (latestRelation.rootPaths.size == 1) {
  df.load(latestRelation.rootPaths.head)
} else {
  df.load(latestRelation.rootPaths: _*)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, what happens if the delta lake implementation adds the "path" option in latestRelation.options?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sezruby sezruby changed the title [WIP] Pluggable source provider with Delta Lake support Support Delta Lake file-based source provider Nov 25, 2020
import spark.implicits._
val dataPathColumn = "_data_path"
val lineageDF = fileIdTracker.getFileToIdMap.toSeq
val isDeltaLakeSource = DeltaLakeRuleUtils.isDeltaLakeSource(df.queryExecution.optimizedPlan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to source provider somehow? Basically, we don't want source specific implementation outside the provider.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question, why does delta not require the replace?

Copy link
Collaborator Author

@sezruby sezruby Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_file_neme() function of Delta returns "file:/" not "file:///" - and I think we need to add some assert for this - join result will be empty if they don't match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. How about we normalize the output of input_file_name by removing "file:/" or "file:///"? (and same to the getFileToIdMap keys)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replacing filename - id map is cheaper than normalizing all file paths of all rows. I'll add an API for this.

@sezruby sezruby self-assigned this Nov 25, 2020
@sezruby sezruby added this to the November 2020 milestone Nov 25, 2020
import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeReference, In, Literal, Not}
import org.apache.spark.sql.catalyst.optimizer.OptimizeIn
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.delta.files.TahoeLogFileIndex
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally approach/integration looks good! I will do a detailed review this week.

@pirz @apoorvedave1 Could you take a look as well?

@apoorvedave1
Copy link
Contributor

LGTM, thanks @sezruby

}
}
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: more hybrid scan + delta lake test cases - #274

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the result comparison tested in #274?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes checkAnswer is used to compare the results in #274

Copy link
Contributor

@pirz pirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks @sezruby

useBucketSpec: Boolean): LogicalPlan = {
val isParquetSourceFormat = index.relations.head.fileFormat.equals("parquet")
val fileFormat = index.relations.head.fileFormat
val isParquetSourceFormat = fileFormat.equals("parquet") || fileFormat.equals("delta")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, does this mean adding a new source provider is not enough?

Can we introduce hasParquetAsSourceFormat to the provider API and record this info in the metadata?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this as a separate PR if that is preferred. Please create an issue in that case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this case is too specific to create an API to source provider; and it refers fileformat string in index metadata, not relation.
So it's better to create the function in IndexLogEntry or some Utils class if needed. WDYT? @imback82

Copy link
Contributor

@imback82 imback82 Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to plug in a source provider defined externally without changing Hyperspace codebase. For example, let's say I have a source format "blah" that uses parquet internally, and how can I plug in without modifying Hyperspace? One easy way to think about is whether you can implement delta source provider outside Hyperspace.

I think this case is too specific to create an API to source provider; and it refers fileformat string in index metadata, not relation.

You can do this in the create path and record it in the metadata.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got your point - it should be able to add a new source provider externally.
Let me handle this with a new pr & issue. Thanks!

}
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the result comparison tested in #274?

@sezruby sezruby closed this Dec 8, 2020
@sezruby sezruby deleted the source_extension branch December 8, 2020 05:45
@sezruby sezruby restored the source_extension branch December 8, 2020 05:47
@sezruby sezruby reopened this Dec 8, 2020
Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one pending comment about introducing hasParquetAsSourceFormat.

@imback82 imback82 merged commit 8fdf28b into microsoft:master Dec 9, 2020
@sezruby sezruby mentioned this pull request Dec 22, 2020
4 tasks
@imback82 imback82 modified the milestones: November 2020, January 2021 Jan 29, 2021
@imback82 imback82 added the enhancement New feature or request label Jan 29, 2021
@sezruby sezruby deleted the source_extension branch April 30, 2021 03:26
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants