Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Nov 8, 2016

What changes were proposed in this pull request?

The current semantic of the warehouse config:

  1. it's a static config, which means you can't change it once your spark application is launched.
  2. Once a database is created, its location won't change even the warehouse path config is changed.
  3. default database is a special case, although its location is fixed, but the locations of tables created in it are not. If a Spark app starts with warehouse path B(while the location of default database is A), then users create a table tbl in default database, its location will be B/tbl instead of A/tbl. If uses change the warehouse path config to C, and create another table tbl2, its location will still be B/tbl2 instead of C/tbl2.

rule 3 doesn't make sense and I think we made it by mistake, not intentionally. Data source tables don't follow rule 3 and treat default database like normal ones.

This PR fixes hive serde tables to make it consistent with data source tables.

How was this patch tested?

HiveSparkSubmitSuite

@cloud-fan
Copy link
Contributor Author

Note that, it's not a completed fix, CTAS and InMemoryCatalog are still broken, I'm sending this PR to get some discussion.

Do we really need to make warehouse path a session-scoped runtime config? Does this config make sense when users connecting to a remote Hive metastore?

Spark app is (most of the time) not a long-running service, but a one shoot program. It's hard to define and reason about the semantic of the warehouse path config, shall we only enable it for local metastore?

cc @yhuai @srowen @rxin

@rxin
Copy link
Contributor

rxin commented Nov 8, 2016

I don't think it makes sense for it to be session specific, at least for now ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you just want to assert this?

(Disregard my earlier comment, this doesn't relate to recent changes I made to the warehouse path)

@yhuai
Copy link
Contributor

yhuai commented Nov 8, 2016

Yea. Warehouse location should not be session specific. Since we will propagate it to hive, it is shared by all sessions.

@SparkQA
Copy link

SparkQA commented Nov 8, 2016

Test build #68351 has finished for PR 15812 at commit fa61496.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan changed the title [SPARK-18360][SQL] warehouse path config should work for data source tables [SPARK-18360][SQL] default table path of tables in default database should depend on the location of default database Nov 16, 2016
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use TestHiveContext so that the hive metadata is cleaned at the beginning.

@SparkQA
Copy link

SparkQA commented Nov 16, 2016

Test build #68706 has finished for PR 15812 at commit bcd9d4b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2016

Test build #68716 has finished for PR 15812 at commit bd8fc95.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68749 has finished for PR 15812 at commit 27be481.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2016

Test build #68784 has finished for PR 15812 at commit 3e2073c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Nov 18, 2016

LGTM. Merging to master and branch 2.1.

asfgit pushed a commit that referenced this pull request Nov 18, 2016
…hould depend on the location of default database

## What changes were proposed in this pull request?

The current semantic of the warehouse config:

1. it's a static config, which means you can't change it once your spark application is launched.
2. Once a database is created, its location won't change even the warehouse path config is changed.
3. default database is a special case, although its location is fixed, but the locations of tables created in it are not. If a Spark app starts with warehouse path B(while the location of default database is A), then users create a table `tbl` in default database, its location will be `B/tbl` instead of `A/tbl`. If uses change the warehouse path config to C, and create another table `tbl2`, its location will still be `B/tbl2` instead of `C/tbl2`.

rule 3 doesn't make sense and I think we made it by mistake, not intentionally. Data source tables don't follow rule 3 and treat default database like normal ones.

This PR fixes hive serde tables to make it consistent with data source tables.

## How was this patch tested?

HiveSparkSubmitSuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #15812 from cloud-fan/default-db.

(cherry picked from commit ce13c26)
Signed-off-by: Yin Huai <yhuai@databricks.com>
@asfgit asfgit closed this in ce13c26 Nov 18, 2016
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…hould depend on the location of default database

## What changes were proposed in this pull request?

The current semantic of the warehouse config:

1. it's a static config, which means you can't change it once your spark application is launched.
2. Once a database is created, its location won't change even the warehouse path config is changed.
3. default database is a special case, although its location is fixed, but the locations of tables created in it are not. If a Spark app starts with warehouse path B(while the location of default database is A), then users create a table `tbl` in default database, its location will be `B/tbl` instead of `A/tbl`. If uses change the warehouse path config to C, and create another table `tbl2`, its location will still be `B/tbl2` instead of `C/tbl2`.

rule 3 doesn't make sense and I think we made it by mistake, not intentionally. Data source tables don't follow rule 3 and treat default database like normal ones.

This PR fixes hive serde tables to make it consistent with data source tables.

## How was this patch tested?

HiveSparkSubmitSuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#15812 from cloud-fan/default-db.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants