Skip to content

Conversation

@joyhaldar
Copy link
Contributor

@joyhaldar joyhaldar commented Dec 28, 2025

The current commit path loads the BigQuery table twice:

  1. During table refresh to get metadata location
  2. During commit to get ETag for the update call

This change stores the table from the refresh step and reuses it during commit, eliminating the redundant load. Concurrent modification detection remains intact via ETag-based optimistic locking in the BigQuery API.

BigQuery API calls per commit:

Before After
doRefresh → loads table doRefresh → loads table
updateTable → loads table again reuses table from refresh

This improves commit latency and reduces tables.get quota consumption.

Changes:

  • Store table loaded during refresh for reuse during commit
  • Remove metadata location comparison which is redundant with ETag check
  • Update test to verify ETag-based conflict detection

… calls

Cache the Table object loaded in doRefresh() for reuse in updateTable(),
eliminating a redundant tables.get call per commit. Concurrent modification
detection is preserved via ETag based optimistic locking in tables.patch.
@github-actions github-actions bot added the GCP label Dec 28, 2025
try {
metadataLocation =
loadMetadataLocationOrThrow(client.load(tableReference).getExternalCatalogTableOptions());
Table table = client.load(tableReference);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this local variable?

Copy link
Contributor Author

@joyhaldar joyhaldar Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this local variable?

Thank you for your review Manu. I used the local variable for readability, but happy to inline if you think it's a good idea.

ExternalCatalogTableOptions options = table.getExternalCatalogTableOptions();
addConnectionIfProvided(table, metadata.properties());

// If `metadataLocationFromMetastore` is different from metadata location of base, it means
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this check removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this check removed?

Thank you for your review Manu.

This check becomes redundant with caching.

Before:

  1. doRefresh() loads table -> metadata location = "v1"
  2. Someone else commits -> metadata location = "v2"
  3. updateTable() loads table again -> sees "v2"
  4. Check catches: "v1" != "v2" -> fail

With caching:

  1. doRefresh() loads table -> metadata location = "v1", cached
  2. Someone else commits -> metadata location = "v2"
  3. updateTable() uses cached table -> still sees "v1"
  4. Check passes: "v1" == "v1" (compares against itself)
  5. tables.patch fails with HTTP 412 (ETag mismatch) -> Iceberg retries

The ETag check in tables.patch catches the same conflict, so this check no longer adds value.


@Test
public void failWhenMetadataLocationDiff() throws Exception {
public void failWhenConcurrentModificationDetected() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you verify table is only loaded once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review Manu. Sorry about that, I have added verification to confirm table is loaded only once in this commit.

Verify table is loaded only once in test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants