diff --git a/website/www/site/content/en/documentation/io/built-in/google-bigquery.md b/website/www/site/content/en/documentation/io/built-in/google-bigquery.md index d49e9bac9492..f53fc5eb72f4 100644 --- a/website/www/site/content/en/documentation/io/built-in/google-bigquery.md +++ b/website/www/site/content/en/documentation/io/built-in/google-bigquery.md @@ -904,6 +904,115 @@ When using `STORAGE_API_AT_LEAST_ONCE`, the `PCollection` returned by [`WriteResult.getFailedStorageApiInserts`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/WriteResult.html#getFailedStorageApiInserts--) contains the rows that failed to be written to the Storage Write API sink. +#### Tune the Storage Write API + +By default, the BigQueryIO Write transform uses Storage Write API settings that +are reasonable for most pipelines. + +If you see performance issues, such as stuck pipelines, quota limit errors, or +monotonically increasing backlog, consider tuning the following pipeline +options when you run the job: + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Option (Java/Python)Description
+

maxConnectionPoolConnections

+

max_connection_pool_connections

+
+ If the write mode is STORAGE_API_AT_LEAST_ONCE and the + useStorageApiConnectionPool option is true, this + option sets the maximum number of connections that each pool creates, per + worker and region. If your pipeline writes many dynamic destinations (more + than 20), and you see performance issues or append operations are + competing for streams, then consider increasing this value. +
+

minConnectionPoolConnections

+

min_connection_pool_connections

+
+

If the write mode is STORAGE_API_AT_LEAST_ONCE and the + useStorageApiConnectionPool option is true, this + option sets the minimum number of connections that each pool creates + before any connections are shared, per worker and region.

+

In practice, the minimum number of connections created is the minimum + of this option and numStorageWriteApiStreamAppendClients x + destination count. BigQuery initially creates that many + connections at first, and only creates more connections if the current + ones are overwhelmed. If you have performance issues, then consider + increasing this value.

+

numStorageWriteApiStreamAppendClients

+

num_storage_write_api_stream_append_clients

+
+ If the write mode is STORAGE_API_AT_LEAST_ONCE, this option + sets the number of stream append clients allocated per worker and + destination. For high-volume pipelines with a large number of workers, + a high value can cause the job to exceed the BigQuery connection quota. + For most low- to mid-volume pipelines, the default value is sufficient. +
+

storageApiAppendThresholdBytes

+

storage_api_append_threshold_bytes

+
+ Maximum size of a single append to the Storage Write API (best effort). +
+

storageApiAppendThresholdRecordCount

+

storage_api_append_threshold_record_count

+
+ Maximum record count of a single append to the Storage Write API (best + effort). +
+

storageWriteMaxInflightRequests

+

storage_write_max_inflight_requests

+
Expected maximum number of inflight messages per connection.
+

useStorageApiConnectionPool

+

use_storage_api_connection_pool

+
+

If true, enables multiplexing mode, where multiple tables + can share the same connection. This mode is only available when the write + mode is STORAGE_API_AT_LEAST_ONCE. Consider enabling + multiplexing if your write operation creates 20 or more connections.

+

If you enable multiplexing, consider setting the following options to + tune the number of connections created by the connection pool:

+
    +
  • minConnectionPoolConnections
  • +
  • maxConnectionPoolConnections
  • +
+

For more information, see + Connection pool management in the BigQuery documentation.

+
+
+ #### Quotas Before using the Storage Write API, be aware of the