From 23db15eb8567fcb79f2615efa5e022629199221b Mon Sep 17 00:00:00 2001 From: Sandeep Kamath Date: Thu, 23 Mar 2023 19:49:57 +0530 Subject: [PATCH 1/2] Remove suggestion to use cuDNN benchmark in docs --- docs/source/en/optimization/fp16.mdx | 13 ------------- 1 file changed, 13 deletions(-) diff --git a/docs/source/en/optimization/fp16.mdx b/docs/source/en/optimization/fp16.mdx index c18cefbde6a9..94dda03e15e5 100644 --- a/docs/source/en/optimization/fp16.mdx +++ b/docs/source/en/optimization/fp16.mdx @@ -20,7 +20,6 @@ We'll discuss how the following settings impact performance and memory. | ---------------- | ------- | ------- | | original | 9.50s | x1 | | cuDNN auto-tuner | 9.37s | x1.01 | -| fp16 | 3.61s | x2.63 | | channels last | 3.30s | x2.88 | | traced UNet | 3.21s | x2.96 | | memory efficient attention | 2.63s | x3.61 | @@ -31,18 +30,6 @@ We'll discuss how the following settings impact performance and memory. steps. -## Enable cuDNN auto-tuner - -[NVIDIA cuDNN](https://developer.nvidia.com/cudnn) supports many algorithms to compute a convolution. Autotuner runs a short benchmark and selects the kernel with the best performance on a given hardware for a given input size. - -Since we’re using **convolutional networks** (other types currently not supported), we can enable cuDNN autotuner before launching the inference by setting: - -```python -import torch - -torch.backends.cudnn.benchmark = True -``` - ### Use tf32 instead of fp32 (on Ampere and later CUDA devices) On Ampere and later CUDA devices matrix multiplications and convolutions can use the TensorFloat32 (TF32) mode for faster but slightly less accurate computations. By default PyTorch enables TF32 mode for convolutions but not matrix multiplications, and unless a network requires full float32 precision we recommend enabling this setting for matrix multiplications, too. It can significantly speed up computations with typically negligible loss of numerical accuracy. You can read more about it [here](https://huggingface.co/docs/transformers/v4.18.0/en/performance#tf32). All you need to do is to add this before your inference: From 75f47e224b6c142f114ad16f369931e4c9d69b8f Mon Sep 17 00:00:00 2001 From: Sandeep Kamath Date: Thu, 23 Mar 2023 20:01:06 +0530 Subject: [PATCH 2/2] removing the wrong line --- docs/source/en/optimization/fp16.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/optimization/fp16.mdx b/docs/source/en/optimization/fp16.mdx index 94dda03e15e5..9d7c0234cdda 100644 --- a/docs/source/en/optimization/fp16.mdx +++ b/docs/source/en/optimization/fp16.mdx @@ -19,7 +19,7 @@ We'll discuss how the following settings impact performance and memory. | | Latency | Speedup | | ---------------- | ------- | ------- | | original | 9.50s | x1 | -| cuDNN auto-tuner | 9.37s | x1.01 | +| fp16 | 3.61s | x2.63 | | channels last | 3.30s | x2.88 | | traced UNet | 3.21s | x2.96 | | memory efficient attention | 2.63s | x3.61 |