Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Create a new node ResolvedV2Table, and resolve UnresolvedV2Relation to this new node instead of DataSourceV2Relation.

Why are the changes needed?

DataSourceV2Relation is a scan node. It should be used when we want to scan a v2 table. However, the DDL commands do not need to scan a v2 table, they just need a node to hold the v2 table. It's possible that there are rules trying to match DataSourceV2Relation and convert it to something else (e.g. a new DataSourceV2ScanRelation), for better data scan. Unfortunately doing this will break these DDL commands. It's better to have a separated node to hold the v2 table for DDL commands.

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

@cloud-fan
Copy link
Contributor Author

@brkyvz @rdblue

@rdblue
Copy link
Contributor

rdblue commented Oct 11, 2019

DataSourceV2Relation is a scan node.

I disagree. It is a relation. It is converted to a scan node when we convert to a physical plan.

Early pushdown does introduce some risk. I don't think there is a case where early pushdown is introducing errors with DELETE FROM, is there? If so, we should consider how to address that, but I doubt it is by creating another nearly identical relation node.

This also creates two different ways that tables are resolved. One for DDL statements and one for other statements. That's adding unnecessary complexity that will hurt later when we need to decide which resolution path to use. If you want to fix possible problems with early pushdown, then let's find out what those problems are and address them directly.

I don't think there is much utility to making a change like this. It just introduces needless churn. I'm -1.

@rdblue
Copy link
Contributor

rdblue commented Oct 11, 2019

I think the solution to the problem is to change the way AlterTable and DeleteFrom wrap a relation.

The same problem exists for write plans: the table that will be written to is resolved to a relation. That relation should not be replaced or modified by a rule like early push-down, so the relation is not added as a child of the plan. If it is not a child, then rules are not automatically run on it.

We can fix this problem the same way. AlterTable should not list the NamedRelation as a child so that rules do not automatically run on it. In addition, ResolveTables should be updated to resolve the relation when it is UnresolvedV2Relation and replace it with DataSourceV2Relation. Last, AlterTable should only return resolved if the relation is resolved.

Since these problems would be introduced by the DataSourceV2ScanRelation work, I'll add this solution to that PR.

@cloud-fan, does that sound like a good solution?

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/16913/
Test PASSed.

@brkyvz
Copy link
Contributor

brkyvz commented Oct 11, 2019 via email

@cloud-fan
Copy link
Contributor Author

@rdblue yea that works too. I'm closing it and will wait for your early pushdown PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants