-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-11277] Respect schemaUpdateOptions during BigQuery load with temporary tables #14113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-11277] Respect schemaUpdateOptions during BigQuery load with temporary tables #14113
Conversation
When using temporary tables to append data to an existing table, first update the schema of the destination table if schema field addition or relaxation are allowed in schemaUpdateOptions. This needs to be done as a separate step because BQ copy jobs do not support schema update when appending to an existing table. WIP because empty files list does not work when submitting load jobs. Pointing to an empty file in GCS does work but that means this empty file needs to be created.
fc1f910 to
9d55947
Compare
9d55947 to
9af141d
Compare
9af141d to
911945c
Compare
|
Run Python 3.8 PostCommit |
| location=temp_table_load_job_reference.location) | ||
| temp_table_schema = temp_table_load_job.configuration.load.schema | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to compare the schema of the destination table with the schema of the temp table job? We'd save one load, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a simple comparison of destination_table.schema == temp_table_schema. It will work for trivial cases but doesn't catch the cases where the order of fields in a record differs. E.g., the following schemas are different according to == even though the temp table can be directly appended to the destination table without error.
<TableSchema
fields: [
<TableFieldSchema fields: [], name: 'bytes', type: 'BYTES'>,
<TableFieldSchema fields: [], name: 'date', type: 'DATE'>,
<TableFieldSchema fields: [], name: 'time', type: 'TIME'>
]>
<TableSchema
fields: [
<TableFieldSchema fields: [], name: 'date', type: 'DATE'>,
<TableFieldSchema fields: [], name: 'time', type: 'TIME'>,
<TableFieldSchema fields: [], name: 'bytes', type: 'BYTES'>
]>
I can probably write a function to check the schema recursively but do you know if one already exists?
apache#14113 (comment) Reusing one single PCollection concentrates the different paths into a single stage which complicates firing of triggers for the stage.
..when destination table schema matches temp table schema.
|
Run Python 3.8 PostCommit |
|
@kmjung @vachan-shetty are you aware of a function in Beam or in BQ APIs that can help us compare two schemas? |
|
I'm not aware of anything like this in open source, no, although I will admit to not being very familiar with the Python SDK. |
|
Merging this for now. @chunyang feel free to add the schema matching functionality, or please create a JIRA issue if you can so we won't forget about it. Thanks a lot for implementing this. It's great to have it. |
The problem: Schema update options are not respected when using temporary tables to load data into BigQuery (e.g., when the number of files to load necessitate using multiple load jobs). Unlike when using query or load jobs, BigQuery does not allow field addition or relaxation when using copy jobs to append data to an existing table.
The solution: Before starting the copy jobs to move data from temporary tables to the final destination table, run zero row load jobs against the destination table(s) using the temporary table schemas and user provided schema update options. These zero row load jobs will update (if needed) the schema of the destination table(s) to accept the data from the forthcoming copy jobs.
If no schema update options are configured, then no zero-row load jobs will be run--
UpdateDestinationSchemaandWaitForSchemaModJobsbecome no-ops.Summary of changes:
BigQueryWrapper.perform_load_jobsto allow triggering load jobs using either a list of source URIs or a single file like object (for this PR the file like object is an empty file, or zero-row CSV file).Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.