Skip to content

[SUPPORT] Hudi spark datasource error after migrate from 0.8 to 0.11 #5861

@kk17

Description

@kk17

Describe the problem you faced

after I update hudi to 0.11 from 0.8, using spark.table(fullTableName) to read a hudi table is not working, the table has been sync to hive metastore and spark is connected to the metastore. the error is

org.sparkproject.guava.util.concurrent.UncheckedExecutionException: org.apache.hudi.exception.HoodieException: 'path' or 'Key: 'hoodie.datasource.read.paths' , default: null description: Comma separated list of file paths to read within a Hudi table. since version: version is not defined deprecated after: version is not defined)' or both must be specified.
at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2263)
at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
at org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.

...

Caused by: org.apache.hudi.exception.HoodieException: 'path' or 'Key: 'hoodie.datasource.read.paths' , default: null description: Comma separated list of file paths to read within a Hudi table. since version: version is not defined deprecated after: version is not defined)' or both must be specified.
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:78)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
	at org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:261)
	at org.sparkproject.guava.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
	at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
	at org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
	at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
	at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)

To Reproduce

Steps to reproduce the behavior:

  1. using hudi 0.8 to create a hudi table and sync to hive metastore using hive jdbc sync mode
  2. update hudi to 0.11
  3. add a new column to the table and sync to hive metastore using hive jdbc sync mode
  4. read the table using spark.table

Expected behavior

reading the table should be ok.

Environment Description

  • Hudi version : 0.11

  • Spark version : 3.1.2

  • Hive version : 3.1.2

  • Hadoop version : 3.1.2

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

Additional context

we are using hive jdbc sync mode to sync hudi table to hive metastore. before we upgrade hudi to 0.11, we will get error for show create table command. after we upgrade hudi to 0.11, we add one new column to the table. the error happen after we add the new column. I run show create table using spark-sql after the error, the command run successful, but the return create table statement is without a location. I also run hive sql, both show create table and select statement is ok.

here are more information. we are using hive jdbc sync mode to sync hudi table to hive metastore. before we upgrade hudi to 0.11, we will get error for show create table command. after we upgrade hudi to 0.11, we add one new column to the table. the error happen after we add the new column. I run show create table using spark-sql after the error, the command run successful, but the return create table statement is without a location. I also run hive sql, both show create table and select statement is ok.

after I drop the hive table and rerun hive sync, it is ok

before hive sync rerun

spark-sql> show create table ods.track_signup;
CREATE TABLE `ods`.`track_signup` (
  `_hoodie_commit_time` STRING,
  `_hoodie_commit_seqno` STRING,
  `_hoodie_record_key` STRING,
  `_hoodie_partition_path` STRING,
  `_hoodie_file_name` STRING,
  `act` STRING,
  `time` BIGINT,
  `env` STRING,
  `id` STRING,
  `seer_time` STRING,
  `hh` STRING,
  `app_id` INT,
  `ip` STRING,
  `g` STRING,
  `u` STRING,
  `ga_id` STRING,
  `app_version` STRING,
  `platform` STRING,
  `url` STRING,
  `referer` STRING,
  `medium` STRING,
  `source` STRING,
  `campaign` STRING,
  `stage` STRING,
  `content` STRING,
  `term` STRING,
  `lang` STRING,
  `su` STRING,
  `campaign_track_id` STRING,
  `last_component_id` STRING,
  `regSourceId` STRING,
  `dt` STRING)
USING hudi
PARTITIONED BY (dt)
TBLPROPERTIES (
  'bucketing_version' = '2',
  'last_modified_time' = '1655107146',
  'last_modified_by' = 'hive',
  'last_commit_time_sync' = '20220613152622014')

after hive sync rerun

spark-sql> show create table ods.track_signup;
CREATE TABLE `ods`.`track_signup` (
  `_hoodie_commit_time` STRING,
  `_hoodie_commit_seqno` STRING,
  `_hoodie_record_key` STRING,
  `_hoodie_partition_path` STRING,
  `_hoodie_file_name` STRING,
  `act` STRING COMMENT 'xxx',
  `time` BIGINT COMMENT 'xxx',
  `env` STRING COMMENT 'xxx',
  `id` STRING COMMENT 'xxx',
  `seer_time` STRING COMMENT 'xxx',
  `hh` STRING,
  `app_id` INT COMMENT 'xxx',
  `ip` STRING COMMENT 'xxx',
  `g` STRING COMMENT 'xxx',
  `u` STRING COMMENT 'xxx',
  `ga_id` STRING COMMENT 'xxx',
  `app_version` STRING COMMENT 'xxx',
  `platform` STRING COMMENT 'xxx',
  `url` STRING COMMENT 'xxx',
  `referer` STRING COMMENT 'xxx',
  `medium` STRING COMMENT 'xxx',
  `source` STRING COMMENT 'xxx',
  `campaign` STRING COMMENT 'xxx',
  `stage` STRING COMMENT 'xxx',
  `content` STRING COMMENT 'xxx',
  `term` STRING COMMENT 'xxx',
  `lang` STRING COMMENT 'xxx',
  `su` STRING COMMENT 'xxx',
  `campaign_track_id` STRING COMMENT 'xxx',
  `last_component_id` STRING COMMENT 'xxx',
  `regSourceId` STRING,
  `dt` STRING)
USING hudi
OPTIONS (
  `hoodie.query.as.ro.table` 'false')
PARTITIONED BY (dt)
LOCATION 's3://xxxx/track_signup'
TBLPROPERTIES (
  'bucketing_version' = '2',
  'last_modified_time' = '1655134599',
  'last_modified_by' = 'hive',
  'last_commit_time_sync' = '20220613153932664')

Metadata

Metadata

Labels

area:sqlSQL interfacespriority:highSignificant impact; potential bugsstatus:triagedIssue has been reviewed and categorized

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions