From 7e5b76ae4e016abbe699df77ccfea34ac681b2c4 Mon Sep 17 00:00:00 2001 From: beliefer Date: Thu, 18 Jan 2024 20:58:33 +0800 Subject: [PATCH 1/3] [SPARK-46760][SQL][DOCS] Make the document of spark.sql.adaptive.coalescePartitions.parallelismFirst clearer --- docs/sql-performance-tuning.md | 2 +- .../main/scala/org/apache/spark/sql/internal/SQLConf.scala | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md index 4ede18d1938bf..a3dcc44f6e63c 100644 --- a/docs/sql-performance-tuning.md +++ b/docs/sql-performance-tuning.md @@ -267,7 +267,7 @@ This feature coalesces the post shuffle partitions based on the map output stati spark.sql.adaptive.coalescePartitions.parallelismFirst true - When true, Spark ignores the target size specified by spark.sql.adaptive.advisoryPartitionSizeInBytes (default 64MB) when coalescing contiguous shuffle partitions, and only respect the minimum partition size specified by spark.sql.adaptive.coalescePartitions.minPartitionSize (default 1MB), to maximize the parallelism. This is to avoid performance regression when enabling adaptive query execution. It's recommended to set this config to false and respect the target size specified by spark.sql.adaptive.advisoryPartitionSizeInBytes. + When true, Spark ignores the target size specified by spark.sql.adaptive.advisoryPartitionSizeInBytes (default 64MB) when coalescing contiguous shuffle partitions, and only respect the minimum partition size specified by spark.sql.adaptive.coalescePartitions.minPartitionSize (default 1MB), to maximize the parallelism. This is to avoid performance regressions when enabling adaptive query execution. To respect the target size specified by spark.sql.adaptive.advisoryPartitionSizeInBytes, please set this config to false. 3.2.0 diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index eb5233bfb1231..6fc6154bbed53 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -721,8 +721,8 @@ object SQLConf { "shuffle partitions, but adaptively calculate the target size according to the default " + "parallelism of the Spark cluster. The calculated size is usually smaller than the " + "configured target size. This is to maximize the parallelism and avoid performance " + - "regression when enabling adaptive query execution. It's recommended to set this config " + - "to false and respect the configured target size.") + "regressions when enabling adaptive query execution. To respect the configured " + + "target size, please set this config to false.") .version("3.2.0") .booleanConf .createWithDefault(true) From a6b49cd09a0bef90fd3422b0b20383a251bda792 Mon Sep 17 00:00:00 2001 From: beliefer Date: Wed, 31 Jan 2024 09:20:55 +0800 Subject: [PATCH 2/3] Update code --- docs/sql-performance-tuning.md | 2 +- .../main/scala/org/apache/spark/sql/internal/SQLConf.scala | 7 +++++-- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md index a3dcc44f6e63c..30d6cb2920d26 100644 --- a/docs/sql-performance-tuning.md +++ b/docs/sql-performance-tuning.md @@ -267,7 +267,7 @@ This feature coalesces the post shuffle partitions based on the map output stati spark.sql.adaptive.coalescePartitions.parallelismFirst true - When true, Spark ignores the target size specified by spark.sql.adaptive.advisoryPartitionSizeInBytes (default 64MB) when coalescing contiguous shuffle partitions, and only respect the minimum partition size specified by spark.sql.adaptive.coalescePartitions.minPartitionSize (default 1MB), to maximize the parallelism. This is to avoid performance regressions when enabling adaptive query execution. To respect the target size specified by spark.sql.adaptive.advisoryPartitionSizeInBytes, please set this config to false. + When true, Spark ignores the target size specified by spark.sql.adaptive.advisoryPartitionSizeInBytes (default 64MB) when coalescing contiguous shuffle partitions, and only respect the minimum partition size specified by spark.sql.adaptive.coalescePartitions.minPartitionSize (default 1MB), to maximize the parallelism. This is to avoid performance regressions when enabling adaptive query execution. This is helpful where even small partitions with small data size require a large amount of computation, and so coalescing the small partitions reduces parallelism and harms performance. In more typical cases where this is not true, coalescing partitions can avoid many tiny tasks and improve performance, and so this config can be set to false. 3.2.0 diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index 6fc6154bbed53..7f204a8aacf8b 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -721,8 +721,11 @@ object SQLConf { "shuffle partitions, but adaptively calculate the target size according to the default " + "parallelism of the Spark cluster. The calculated size is usually smaller than the " + "configured target size. This is to maximize the parallelism and avoid performance " + - "regressions when enabling adaptive query execution. To respect the configured " + - "target size, please set this config to false.") + "regressions when enabling adaptive query execution. This is helpful where even small " + + "partitions with small data size require a large amount of computation, and so " + + "coalescing the small partitions reduces parallelism and harms performance. In more " + + "typical cases where this is not true, coalescing partitions can avoid many tiny tasks " + + "and improve performance, and so this config can be set to false.") .version("3.2.0") .booleanConf .createWithDefault(true) From 09f1b1d49b5d5772f9c987fccab80145318d3424 Mon Sep 17 00:00:00 2001 From: beliefer Date: Thu, 1 Feb 2024 16:15:45 +0800 Subject: [PATCH 3/3] Update code --- docs/sql-performance-tuning.md | 2 +- .../scala/org/apache/spark/sql/internal/SQLConf.scala | 8 +++----- 2 files changed, 4 insertions(+), 6 deletions(-) diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md index 30d6cb2920d26..e3e5444d2a9c8 100644 --- a/docs/sql-performance-tuning.md +++ b/docs/sql-performance-tuning.md @@ -267,7 +267,7 @@ This feature coalesces the post shuffle partitions based on the map output stati spark.sql.adaptive.coalescePartitions.parallelismFirst true - When true, Spark ignores the target size specified by spark.sql.adaptive.advisoryPartitionSizeInBytes (default 64MB) when coalescing contiguous shuffle partitions, and only respect the minimum partition size specified by spark.sql.adaptive.coalescePartitions.minPartitionSize (default 1MB), to maximize the parallelism. This is to avoid performance regressions when enabling adaptive query execution. This is helpful where even small partitions with small data size require a large amount of computation, and so coalescing the small partitions reduces parallelism and harms performance. In more typical cases where this is not true, coalescing partitions can avoid many tiny tasks and improve performance, and so this config can be set to false. + When true, Spark ignores the target size specified by spark.sql.adaptive.advisoryPartitionSizeInBytes (default 64MB) when coalescing contiguous shuffle partitions, and only respect the minimum partition size specified by spark.sql.adaptive.coalescePartitions.minPartitionSize (default 1MB), to maximize the parallelism. This is to avoid performance regressions when enabling adaptive query execution. It's recommended to set this config to true on a busy cluster to make resource utilization more efficient (not many small tasks). 3.2.0 diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala index 7f204a8aacf8b..1d7b86cba9175 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala @@ -721,11 +721,9 @@ object SQLConf { "shuffle partitions, but adaptively calculate the target size according to the default " + "parallelism of the Spark cluster. The calculated size is usually smaller than the " + "configured target size. This is to maximize the parallelism and avoid performance " + - "regressions when enabling adaptive query execution. This is helpful where even small " + - "partitions with small data size require a large amount of computation, and so " + - "coalescing the small partitions reduces parallelism and harms performance. In more " + - "typical cases where this is not true, coalescing partitions can avoid many tiny tasks " + - "and improve performance, and so this config can be set to false.") + "regressions when enabling adaptive query execution. It's recommended to set this " + + "config to true on a busy cluster to make resource utilization more efficient (not many " + + "small tasks).") .version("3.2.0") .booleanConf .createWithDefault(true)