Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Conversation

@imback82
Copy link
Contributor

@imback82 imback82 commented Feb 11, 2021

What is the context for this pull request?

#321 and #320 require to support DatasourceV2Relation, but the current provider was designed to support only the LogicalRelation, so adding a new provider that supports a different relation type requires lots of code changes across actions/rules.

What changes were proposed in this pull request?

This proposes to introduce one more abstraction to FileBasedSourceProvider such that each provider now needs to implement FileBasedRelation. For example, DeltaLakeFileBasedSource needs to implement DeltaRelation which extends FileBasedRelation to handle a Delta Lake specific relation.

By decoupling this, actions/rules do not depend on LogicalRelation directly.

Does this PR introduce any user-facing change?

No

How was this patch tested?

API refactoring. Existing tests should be enough.

@imback82 imback82 self-assigned this Feb 11, 2021
@imback82 imback82 added the enhancement New feature or request label Feb 11, 2021
@imback82 imback82 added this to the February 2021 (v0.5.0) milestone Feb 11, 2021
@andrei-ionescu
Copy link
Contributor

@imback82 Just some early feedback...

I've seen you added the FileBasedRelation trait. The DataSourceV2Relation is not necessarily a file base relation. I would suggest to have this trait defined as a more generic relation, not only file based ones. As you suggest on my #321 PR SourceRelation fits better.

@imback82
Copy link
Contributor Author

We need to have files-related APIs such as allFiles at FileBasedRelation (I thought about having SupportsFiles trait, but didn't want to complicate APIs at this stage). Until we have a good use case/example of supporting non-filed relations, making it explicit seems reasonable (similar to we store FileBasedSourceProviderManager in the hyperpsace context).

Similar to SourceProvider, I can have a top level trait SourceRelation, but will still use FileBasedRelation in FileBasedSourceProviderManager.getRelation.

@andrei-ionescu
Copy link
Contributor

First, regarding Iceberg and DataSourceV2...

The Iceberg source does not use and index (like the InMemoryFileIndex) and the files attached to an Iceberg table are not attached to a relation. To retrieve the files of an Iceberg table you need to use the table file scan api.

See https://github.com/microsoft/hyperspace/pull/320/files#diff-01b72d2de6f62e2696e47b850eb7e40039fa85120d40ef392df1cf7b44aff9f8R246-R259

Second, it does NOT look right to have the DataSourceV2Relation hard linked to the FileBasedRelation and I don't know if this refactor make the code clearer in the case of DataSourceV2 sources like Iceberg.

In the end, if you still consider that this will bring clarity, please complete this PR ASAP and merge it, so that I could modify my PRs to fit these changes.

Thank you.

@imback82
Copy link
Contributor Author

Second, it does NOT look right to have the DataSourceV2Relation hard linked to the FileBasedRelation and I don't know if this refactor make the code clearer in the case of DataSourceV2 sources like Iceberg.

Why not? I see that you are implementing allFiles? Your provider for iceberg can just check if the given datasourcev2relation is applicable.

I will get this ready for review by eod today as promised.

@imback82 imback82 changed the title [WIP] Introduce FileBasedRelation trait to decouple relation types from source provider. Introduce Feb 12, 2021
@imback82 imback82 changed the title Introduce Introduce SourceRelation/FileBasedRelation traits to remove direct dependency on LogicalRelation from actions/rules Feb 12, 2021
}
}

protected def sourceRelations(spark: SparkSession, df: DataFrame): Seq[Relation] =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed as creating Relation has moved to FileBasedRelation.

(indexDF, resolvedIndexedColumns, resolvedIncludedColumns)
}

private def getPartitionColumns(df: DataFrame): Seq[String] = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed as FileBasedRelation supports getting the partition schema.

Hyperspace
.getContext(spark)
.sourceProviderManager
.allFiles(relation)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allFiles has moved to FileBasedRelation.

RuleUtils.getLogicalRelation(l).isDefined && RuleUtils.getLogicalRelation(r).isDefined &&
isPlanLinear(l) && isPlanLinear(r) && !isPlanModified(l) && !isPlanModified(r) &&
ensureAttributeRequirements(l, r, condition)
isPlanLinear(l) && isPlanLinear(r) &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apoorvedave1 we don't need isPlanLinear if getRelation implementation is as follows, right (collecting leaves and making sure it's only one)?:

  def getRelation(spark: SparkSession, plan: LogicalPlan): Option[FileBasedRelation] = {
    val provider = Hyperspace.getContext(spark).sourceProviderManager
    val leaves = plan.collectLeaves()
    if (leaves.size == 1 && provider.isSupportedRelation(leaves.head)) {
      Some(provider.getRelation(leaves.head))
    } else {
      None
    }
  }

Copy link
Contributor

@apoorvedave1 apoorvedave1 Feb 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense, we can eliminate isPlanLinear check

* @param plan Logical plan.
* @return true if the relation in the plan is modified by Hyperspace.
*/
private def isPlanModified(plan: LogicalPlan): Boolean = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed because this is directly testes in isApplicable above.

Hyperspace
.getContext(spark)
.sourceProviderManager
.allFiles(rel)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, allFiles has moved to FileBasedRelation (see below).

.partitionBasePath(location)
case l: LeafNode if provider.isSupportedRelation(l) =>
val relation = provider.getRelation(l)
val options = relation.partitionBasePath
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sezruby Question. Below line 482, do we need to create a tag on the originalPlan or can it be just the logical plan for the relation?:

val newLocation = index.withCachedTag(
          originalPlan,
          IndexLogEntryTags.INMEMORYFILEINDEX_HYBRID_SCAN_APPENDED) {
          new InMemoryFileIndex(spark, filesAppended, options, None)
        }

getCandidateIndexes uses the relation plan, so I was curious.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea getCandidateIndexes using a relation plan, but for here I used originalPlan.
It might be better to use the relation plan here as it's more reusable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, we can do that as a follow up (wanted to minimize the behavior change in this PR)

*/
def refreshRelation(relation: Relation): Relation = {
run(p => p.refreshRelation(relation))
def refreshRelationMetadata(relation: Relation): Relation = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have so many "Relation"s, so I am explicitly calling it out as relation metadata here now.

/**
* Implementation for file-based relation used by [[DefaultFileBasedSource]]
*/
class DefaultFileBasedRelation(spark: SparkSession, override val plan: LogicalRelation)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the implementation is from DefaultFileBasedSource with minor changes to change the return type: we don't need to wrap with Option any longer.

formats.toLowerCase(Locale.ROOT).split(",").map(_.trim).toSet
})

/**
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed code has moved to DefaultFileBasedRelation.

@imback82
Copy link
Contributor Author

imback82 commented Feb 12, 2021

@andrei-ionescu @sezruby @apoorvedave1 This is ready for review now (I will update the PR description soon, but I believe you already know the context. 😄 ). Basically, I removed the direct dependency on LogicalRelation from actions/rules code.

@andrei-ionescu Now, you can just implement IcebergProvider and IcebergRelation without modifying anywhere else, hopefully, 🤞🏼

imback82 and others added 3 commits February 11, 2021 18:00
This reverts commit 22f017a.
we worked on this together

Co-Authored-By: Andrei Ionescu <webdev.andrei@gmail.com>
@imback82
Copy link
Contributor Author

@andrei-ionescu I also add you as a co-author in this commit 50ed5cd, so when this PR is merged, you will get the co-authorship. Thanks!

@andrei-ionescu
Copy link
Contributor

@imback82 Thanks for co-authoring!

@sezruby
Copy link
Collaborator

sezruby commented Feb 12, 2021

The approach looks great. Thanks :)

Copy link
Contributor

@apoorvedave1 apoorvedave1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, 👍 thanks @imback82

Copy link
Contributor Author

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andrei-ionescu @sezruby @apoorvedave1 for the review!

I will create a follow up PR to address a couple of comments.

@imback82 imback82 merged commit 0fef997 into microsoft:master Feb 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants