-
Notifications
You must be signed in to change notification settings - Fork 2.9k
BigQuery: Reuse table from refresh during commit to reduce API calls #14940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
BigQuery: Reuse table from refresh during commit to reduce API calls #14940
Conversation
… calls Cache the Table object loaded in doRefresh() for reuse in updateTable(), eliminating a redundant tables.get call per commit. Concurrent modification detection is preserved via ETag based optimistic locking in tables.patch.
| try { | ||
| metadataLocation = | ||
| loadMetadataLocationOrThrow(client.load(tableReference).getExternalCatalogTableOptions()); | ||
| Table table = client.load(tableReference); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this local variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this local variable?
Thank you for your review Manu. I used the local variable for readability, but happy to inline if you think it's a good idea.
| ExternalCatalogTableOptions options = table.getExternalCatalogTableOptions(); | ||
| addConnectionIfProvided(table, metadata.properties()); | ||
|
|
||
| // If `metadataLocationFromMetastore` is different from metadata location of base, it means |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this check removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this check removed?
Thank you for your review Manu.
This check becomes redundant with caching.
Before:
- doRefresh() loads table -> metadata location = "v1"
- Someone else commits -> metadata location = "v2"
- updateTable() loads table again -> sees "v2"
- Check catches: "v1" != "v2" -> fail
With caching:
- doRefresh() loads table -> metadata location = "v1", cached
- Someone else commits -> metadata location = "v2"
- updateTable() uses cached table -> still sees "v1"
- Check passes: "v1" == "v1" (compares against itself)
- tables.patch fails with HTTP 412 (ETag mismatch) -> Iceberg retries
The ETag check in tables.patch catches the same conflict, so this check no longer adds value.
|
|
||
| @Test | ||
| public void failWhenMetadataLocationDiff() throws Exception { | ||
| public void failWhenConcurrentModificationDetected() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you verify table is only loaded once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review Manu. Sorry about that, I have added verification to confirm table is loaded only once in this commit.
Verify table is loaded only once in test
The current commit path loads the BigQuery table twice:
This change stores the table from the refresh step and reuses it during commit, eliminating the redundant load. Concurrent modification detection remains intact via ETag-based optimistic locking in the BigQuery API.
BigQuery API calls per commit:
doRefresh→ loads tabledoRefresh→ loads tableupdateTable→ loads table againThis improves commit latency and reduces tables.get quota consumption.
Changes: