Skip to content

Conversation

@satybald
Copy link
Contributor

@satybald satybald commented Jul 28, 2021

When multiple load jobs are needed to write data to a destination table, e.g., when the data is spread over more than 10,000 URIs, WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and then update the temporary tables if schema additions is allowed.

However, update of temporary table scheme does not respect a specified source format of the loading files(i.e. JSON, AVRO). By default source format for BQ load job is CSV, which causes jobs with nested schema to fail with the error:

"message": "Cannot load CSV data with a nested schema. Field: nested_field",

In theory, it doesn't matter which source format specifed(besides CSV one) as the load job request doesn't have any source URIs.

cc: @pabloem @aaltay @tvalentyn

ValidatesRunner compliance status (on master branch)

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- Build Status Build Status Build Status Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Python --- Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status ---
XLang Build Status Build Status Build Status Build Status Build Status ---

Examples testing status on various runners

Lang ULR Dataflow Flink Samza Spark Twister2
Go --- --- --- --- --- --- ---
Java --- Build Status
Build Status
Build Status
--- --- --- --- ---
Python --- --- --- --- --- --- ---
XLang --- --- --- --- --- --- ---

Post-Commit SDK/Transform Integration Tests Status (on master branch)

Go Java Python
Build Status Build Status Build Status
Build Status
Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status Build Status --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@codecov
Copy link

codecov bot commented Jul 28, 2021

Codecov Report

Merging #15237 (3b994d6) into master (205fbb1) will decrease coverage by 0.00%.
The diff coverage is 80.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15237      +/-   ##
==========================================
- Coverage   83.83%   83.83%   -0.01%     
==========================================
  Files         441      441              
  Lines       59706    59709       +3     
==========================================
+ Hits        50057    50059       +2     
- Misses       9649     9650       +1     
Impacted Files Coverage Δ
...s/python/apache_beam/io/gcp/bigquery_file_loads.py 87.58% <80.00%> (+0.08%) ⬆️
sdks/python/apache_beam/utils/interactive_utils.py 87.80% <0.00%> (-7.32%) ⬇️
sdks/python/apache_beam/io/source_test_utils.py 88.47% <0.00%> (-1.39%) ⬇️
...hon/apache_beam/runners/worker/bundle_processor.py 93.26% <0.00%> (-0.38%) ⬇️
...eam/runners/interactive/interactive_environment.py 90.33% <0.00%> (-0.38%) ⬇️
...ks/python/apache_beam/runners/worker/data_plane.py 92.42% <0.00%> (+1.81%) ⬆️
...hon/apache_beam/runners/direct/test_stream_impl.py 96.26% <0.00%> (+2.23%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 205fbb1...3b994d6. Read the comment docs.

@aaltay aaltay requested a review from pabloem July 29, 2021 01:39
job_name = '%s_%s_%s' % (schema_mod_job_name_prefix, destination_hash, uid)

_LOGGER.debug(
_LOGGER.info(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema modification job doesn't often happen 1-2 per job. Adding as info level will help in troubleshooting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@pabloem
Copy link
Member

pabloem commented Aug 3, 2021

Run PythonDocker PreCommit

@pabloem
Copy link
Member

pabloem commented Aug 3, 2021

change LGTM

@pabloem pabloem merged commit 9ce826e into apache:master Aug 3, 2021
@satybald satybald deleted the satybald/update-schema-source-format branch August 3, 2021 17:27
@lukecwik
Copy link
Member

I believe this change broke python postcommit, filed: https://issues.apache.org/jira/browse/BEAM-12765

The postcommit is failing with error messages like Cannot access field fruit on a value with type ARRAY<STRUCT<fruit STRING>>

@pabloem
Copy link
Member

pabloem commented Aug 16, 2021

sorry about that. I'll revert this.

pabloem added a commit to pabloem/beam that referenced this pull request Aug 16, 2021
…ith update schema source format "

This reverts commit 9ce826e, reversing
changes made to b20a42e.
@chunyang
Copy link
Contributor

Haven't tested, but looks like changing nested_field's mode from REPEATED to REQUIRED or NULLABLE should fix the test? Or change the query in the failing test to select nested_field[OFFSET(0)].fruit?

@satybald satybald restored the satybald/update-schema-source-format branch August 17, 2021 22:02
@satybald
Copy link
Contributor Author

@pabloem @lukecwik sorry folks, I apparently run a DirectRunner with integration tests, instead of TestDirectRunner.

Might I have a review for PR? #15352

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants