-
Notifications
You must be signed in to change notification settings - Fork 3k
Update iceberg-spark to use HiveTables #239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| HadoopTables tables = new HadoopTables(conf); | ||
| return tables.load(path.get()); | ||
| } else { | ||
| HiveTables tables = new HiveTables(conf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HiveTables creates a connection pool that we will need to close. Maybe we don't need to worry about this for now because Spark 3.0 will use a catalog with a more reasonable life-cycle, but it seems like a bad idea to leak connections. It is also difficult to close this because these are instantiated every time.
What about creating a static cache and getting a HiveTables instance based on the value of hive.metastore.uris?
|
@aokolnychyi, nice work! It is ready other than the connection problem. I think we need to add a static cache to avoid creating too many HMS connections. |
| import org.junit.Test; | ||
|
|
||
| import static org.apache.hadoop.hive.conf.HiveConf.ConfVars.METASTOREURIS; | ||
| import static org.apache.iceberg.types.Types.NestedField.optional; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Baseline may frown at this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We disable AvoidStaticImport for tests in checkstyle-suppressions.xml. I still try to avoid static imports, but the statement would be too long in this particular case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
|
@aokolnychyi, I merged #240, so could you rebase and update this to use HiveCatalog? |
03d1cc4 to
a7e68c2
Compare
a7e68c2 to
6a8716f
Compare
6a8716f to
9b8b3a6
Compare
| public static HiveCatalog loadCatalog(Configuration conf) { | ||
| // metastore URI can be null in local mode | ||
| String metastoreUri = conf.get(HiveConf.ConfVars.METASTOREURIS.varname, ""); | ||
| return CATALOG_CACHE.get(metastoreUri, uri -> new HiveCatalog(conf)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a bit problematic as we cache the Hadoop conf. Let me think about possible implications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, one case when this won't work is for hadoop. data source options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building a cache Configuration -> HiveCatalog doesn't seem as an option either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're okay with that for 2.4 support. This will be improved when using a real catalog for Spark 3.0. Also, the hadoop. options are mostly for configuring the read or write, not for controlling the metastore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue, you are right. However, hadoop. data source options can be used to set iceberg.compress.metadata if we want to compress metadata only in particular tables.
Ideally, iceberg.compress.metadata should be a table property. It is the only entry in ConfigProperties. The problem is that HadoopTables requires this config upfront to find tables. Maybe, we can circumvent this by trying files with/without gz suffix.
This PR enables support for
HiveTablesinIcebergSourceand resolves #8.