-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Add location overlap validation for SnapshotTableAction #14933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Spark: Add location overlap validation for SnapshotTableAction #14933
Conversation
…ng destination table location from overlaping source table location. Resolves TODO comment in SnapshotTableSparkAction.java for spark v4.1
…ng destination table location from overlaping source table location. Resolves TODO comment in SnapshotTableSparkAction.java for spark v4.1
…ng destination table location from overlaping source table location. Resolves TODO comment in SnapshotTableSparkAction.java for spark v4.1
|
Definitely needs a test case, not sure why windows would make testing impossible since our locations are stored as strings |
|
Iceberg errors follow this format "Cannot X because Y (possibly fix Z)" So in this case Cannot create a snapshot at location ... because it would overlap with source table ... Overlapping snapshot and source would .... |
Thanks for the feedback. All CI checks are passing now. |
@RussellSpitzer |
|
@RussellSpitzer PTAL |
|
PTAL @pvary |
|
|
||
| // TODO: Check the dest table location does not overlap with the source table location | ||
| String sourceTableLocation = sourceTableLocation(); | ||
| String actualDestTableLocation = (icebergTable.location()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe stagedTableLocation?
Also, please remove the extra parentheses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely Looks more appropriate, Will Commit with other changes.
| @TestTemplate | ||
| public void testSnapshotWithOverlappingLocation() throws IOException { | ||
| String catalogType = catalogConfig.get(ICEBERG_CATALOG_TYPE); | ||
| assumeThat(catalogType).isNotEqualTo(ICEBERG_CATALOG_TYPE_HADOOP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we don't test with hadoop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we don't test with hadoop?
Hadoop doesn't allow custom destination location via .tableLocation() when done this it fails even before the check while StagedSparkTable stagedTable = stageDestTable();
Error Message:
TestSnapshotTableAction > testSnapshotWithOverlappingLocation() > catalogName = testhadoop, implementation = org.apache.iceberg.spark.SparkCatalog, config = {type=hadoop, cache-enabled=false} FAILED java.lang.IllegalArgumentException: Cannot set a custom location for a path-based table. Expected file:/tmp/warehouse62334905503302693.tmp/default/table but got /tmp/junit-17415081344650343329/newJunit10343828823716601592
because of HadoopCatalog withLocation() check
public TableBuilder withLocation(String location) { Preconditions.checkArgument( location == null || location.equals(defaultLocation), "Cannot set a custom location for a path-based table. Expected " + defaultLocation + " but got " + location); return this; }
So, when we can't give custom location and TestBase predefine the warehouse using which hadoop always derives non-overlapping location the test seems useless unless we change the warehouse.
For the same reason I skipped Hadoop catalog for the tests including .tableLocation().
Do let me know If I should think otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best to just add a note here, "//Hadoop Catalogs do not Support Custom Table Locations"
| String validDestLocation = new Path(parentLocation, "newDestination").toUri().toString(); | ||
| SparkActions.get() | ||
| .snapshotTable(SOURCE_NAME) | ||
| .as(tableName) | ||
| .tableLocation(validDestLocation) | ||
| .execute(); | ||
| assertThat(sql("SELECT * FROM %s", tableName)).hasSize(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this test here?
Should this be a different test case for the happy path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along with Overlapping location tests, A Non Overlapping test should also be tested.
Yes, This test should not be in TestSnapshotWithOverlappingLocation so I have separated TestSnapshotWithNonOverlappingLocation and placed this assertion in it
|
@pvary I have addressed the reviews. Please take another look and let me know if any concerns |
| import java.nio.file.Files; | ||
| import java.util.concurrent.Executors; | ||
| import java.util.concurrent.atomic.AtomicInteger; | ||
| import org.apache.hadoop.fs.Path; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we need Hadoop's Path class here, let's stick to Java built ins if we can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only remaining concern here is we are using the Hadoop Path class and I"m pretty sure we don't need that dependency here
Description
Implements location overlap validation for
SnapshotTableActionto prevent destination table location from overlapping with source table location.Resolves TODO comment in
SnapshotTableSparkAction.java:127in spark v4.1.Type of change
How Has This Been Tested?
Checklist