Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@imback82
Copy link
Contributor

@imback82 imback82 commented Oct 25, 2020

What is the context for this pull request?

Introduce pluggable source provider.

What changes were proposed in this pull request?

This PR proposes to introduce pluggable source provider APIs so that new source such as Delta Lake can be easily plugged in.

Does this PR introduce any user-facing change?

Yes. This PR exposes a new way to plug in a source from which an index can be created.

How was this patch tested?

Existing tests

@imback82 imback82 marked this pull request as draft October 25, 2020 07:27
@imback82 imback82 self-assigned this Oct 25, 2020
@imback82 imback82 added the enhancement New feature or request label Oct 25, 2020
@imback82 imback82 added this to the October 2020 milestone Oct 25, 2020
@imback82
Copy link
Contributor Author

@sezruby Did you get a chance to implement #197 using this PR?

* @tparam A Type of the initial value.
* @tparam B Type of the final value.
*/
class CacheWithTransform[A, B](val init: () => A, val transform: A => B) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to add a test for this class.

* - The relation is [[HadoopFsRelation]] with [[PartitioningAwareFileIndex]] as file index.
* - Its file format implements [[DataSourceRegister]].
*/
class DefaultFileBasedSource(private val spark: SparkSession) extends FileBasedSourceProvider {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to create a test for this class.

*
* @param spark Spark session.
*/
class FileBasedSourceProviderManager(spark: SparkSession) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to create a test for this.

spark.sessionState.conf
.getConfString(
"spark.hyperspace.index.sources.fileBasedBuilders",
"com.microsoft.hyperspace.index.sources.default.DefaultFileBasedSourceBuilder")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this config is necessary? Can we just create all builders?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To control what builders are being loaded similar to how plugins, listeners, etc. are loaded in SparkContext. What's the benefit of creating all builders available? The only benefit I can think of is not having to specify this config, but having a tight control seems better to me.

@imback82 imback82 changed the title [WIP] Pluggable source provider Pluggable source provider Nov 25, 2020
Copy link
Contributor Author

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a offline chat with @sezruby and will merge this PR to unblock other PRs related to Delta Lake. I will create issues to do follow up PRs to address comments in this PR.

cc: @pirz @apoorvedave1 @rapoth

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants