Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Aug 22, 2024

While reviewing #10288, I realized we don't have a reliable way to load Iceberg table state as Dataset in Spark. We shouldn't use load(table.name()) as it is not clear if the name already includes the catalog name. This PR extends what we currently do for metadata tables to regular tables.


public static Dataset<Row> loadTable(SparkSession spark, Table table, long snapshotId) {
SparkTable sparkTable = new SparkTable(table, snapshotId, false);
DataSourceV2Relation relation = createRelation(sparkTable, ImmutableMap.of());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snapshotId(and timestamp) could also be supplied as an option in the future Spark versions.
Should we have an method to take options as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually bypass the resolution completely and manually create Dataset in this case.

Copy link
Contributor Author

@aokolnychyi aokolnychyi Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to pass options in the future but let's add that once there is a use case (we will simply overload this method).

required(5, "stringCol", Types.StringType.get()));

@TestTemplate
public void testLoadingTableDirectly() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test would previously fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Should we move this test to org.apache.iceberg.spark.TestSparkTableUtil

Copy link
Contributor Author

@aokolnychyi aokolnychyi Aug 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it belongs here as it is important to check the action can be invoked without loading tables via the Spark catalog (as that one will set the catalog name correctly).

This is the only test that goes via validationCatalog.

Copy link
Contributor

@karuppayya karuppayya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm,
left a nitpick, thanks @aokolnychyi for the change and @nastra for reviewing.

@aokolnychyi aokolnychyi merged commit 5864850 into apache:main Aug 24, 2024
@aokolnychyi
Copy link
Contributor Author

Thanks, @karuppayya @nastra!

dramaticlly added a commit to dramaticlly/iceberg that referenced this pull request Sep 11, 2024
backport of apache#10984, tests can be backport in together with apache#11106
dramaticlly added a commit to dramaticlly/iceberg that referenced this pull request Sep 13, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants