Skip to content

Update README.md#1946

Closed
howard0su wants to merge 1 commit intoggml-org:masterfrom
howard0su:patch-1
Closed

Update README.md#1946
howard0su wants to merge 1 commit intoggml-org:masterfrom
howard0su:patch-1

Conversation

@howard0su
Copy link
Copy Markdown
Contributor

Note on cmake version when building with CUDA.

Note on cmake version when building with CUDA.
@JohannesGaessler
Copy link
Copy Markdown
Contributor

If it works that's great; I think the authority of which cmake versions are officially supported is up to @ggerganov though. There is also the question of whether CUDA < 11.6 is officially supported.

@howard0su
Copy link
Copy Markdown
Contributor Author

howard0su commented Jun 20, 2023

Also I tested with native and OFF on Windows, I cannot tell the difference. the variance is very minor.

If you have more info regarding perf on Windows, I can help debug. my main dev box is Windows with P100 card.

Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we want to keep the CMake version as low as possible.

It is OK to do things like this for example:

https://github.com/ggerganov/llama.cpp/blob/2322ec223a21625dfe9bd73ee677444a98a24ac9/CMakeLists.txt#L160-L162

https://github.com/ggerganov/llama.cpp/blob/2322ec223a21625dfe9bd73ee677444a98a24ac9/CMakeLists.txt#L228-L230

But we should not bump the required CMake version for standard CPU-only builds for no good reason.

What are the pros and cons of not supporting CUDA <11.6?

@howard0su
Copy link
Copy Markdown
Contributor Author

For cmake, i would vote to go back to OFF instead of native. I don't see perf gain on Windows. @JohannesGaessler what's your intentions for the change?

@JohannesGaessler
Copy link
Copy Markdown
Contributor

What are the pros and cons of not supporting CUDA <11.6?

CUDA 11.6 seems to be when the -march=native compilation option was added. So e.g. the Makefile which already used that option previously won't work out-of-the-box for lower CUDA versions.

For cmake, i would vote to go back to OFF instead of native. I don't see perf gain on Windows. @JohannesGaessler what's your intentions for the change?

I implemented an option that enables the use of CUDA f16 intrinsics. With make you can build it without issue but with cmake the compilation failed; setting CUDA_ARCHITECTURES to native has fixed this. Maybe we should make the minimum required cmake version and arch conditional on that compilation option? In any case, I think we should try to be consistent in regards to arch for make and cmake.

@slaren
Copy link
Copy Markdown
Member

slaren commented Jun 20, 2023

You could set CUDA_ARCHITECTURES to the minimum arch required to use these intrinsics and remove CUDA_SELECT_NVCC_ARCH_FLAGS.

@slaren
Copy link
Copy Markdown
Member

slaren commented Jun 20, 2023

What are the pros and cons of not supporting CUDA <11.6?

I am not sure that there are many pros. At some point, this was required to support some devices like the NVIDIA Jetson, but that no longer seems to be the case. I guess it may save the trouble of upgrading to people who already have an older version installed.

@jllllll
Copy link
Copy Markdown

jllllll commented Jun 20, 2023

What are the pros and cons of not supporting CUDA <11.6?

CUDA 11.6 seems to be when the -march=native compilation option was added. So e.g. the Makefile which already used that option previously won't work out-of-the-box for lower CUDA versions.

For cmake, i would vote to go back to OFF instead of native. I don't see perf gain on Windows. @JohannesGaessler what's your intentions for the change?

I implemented an option that enables the use of CUDA f16 intrinsics. With make you can build it without issue but with cmake the compilation failed; setting CUDA_ARCHITECTURES to native has fixed this. Maybe we should make the minimum required cmake version and arch conditional on that compilation option? In any case, I think we should try to be consistent in regards to arch for make and cmake.

The CUDA 11.6 Release Notes mentions this: NVCC introduced -arch=native in CUDA 11.5 update1

Maybe I am misunderstanding the NVCC docs, but I was under the impression that the default behavior of NVCC was to compile for the currently installed GPU architectures when no architecture is specified. This is what happens on Windows when CUDA_ARCHITECTURES is set to OFF and CUDA_SELECT_NVCC_ARCH_FLAGS is removed. Is this not the case for other systems?

@JohannesGaessler
Copy link
Copy Markdown
Contributor

JohannesGaessler commented Jun 21, 2023

I did some quick testing on my RTX 3090:

arch t/s 33b q4_0 t/s 33b q4_0 LLAMA_CUDA_DMMV_F16=1
native == compute_86 18.78 20.73
compute_75 18.56 20.65
compute_61 18.55 20.71
default == compute_52 18.48 Does not compile

The minimum arch that can run all code is compute_61 (Pascal). If there is a performance difference it is very small. Looking at the nvcc documentation however it seems to me that arch should not affect performance. If I understand it correctly, the compilation is done in two stages: first intermediary PTX code is generated during the conventional compilation. Then, during runtime the actual binary code for the GPU is generated if it does not exist. With native or the --gpu-code option the binary code is generated ahead of time. However, for llama.cpp this does not seem to make a measurable difference for the runtime:

arch runtime 7b q4_0 1 token [s]
native 2.432
compute_86 2.433
compute_61 2.429

So my impression is that setting arch to compute_61 should work as long as we don't want to support Maxwell.

@slaren
Copy link
Copy Markdown
Member

slaren commented Jun 21, 2023

My understanding is that using a lower architecture may prevent the compiler from using the newer features or instructions in the intermediate code, which in turn may cause the JIT to produce worse code for the newer architectures. But if the tests don't show any meaningful difference, it may be the best solution.

@slaren
Copy link
Copy Markdown
Member

slaren commented Jun 21, 2023

If I understand correctly, the use of the Pascal intrinsics can already be disabled by undefining GGML_CUDA_DMMV_F16. So you could just check if we are building for compute_61+ using the __CUDA_ARCH__ macro, and disable GGML_CUDA_DMMV_F16 accordingly. Then, we add an option to the cmake to allow the user to select the architecture, and set it to 61 by default.
The possibles values of __CUDA_ARCH__ are described here, but essentially:

The architecture identification macro CUDA_ARCH is assigned a three-digit value string xy0 (ending in a literal 0) during each nvcc compilation stage 1 that compiles for compute_xy.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

By default the f16 intrinsics are disabled. They are only used if the user compiles with LLAMA_CUDA_DMMV_F16. So if we want to support Maxwell I think we should instead keep the default for arch which is compute_52 for CUDA 12 and switch to a higher arch if necessary.

@howard0su
Copy link
Copy Markdown
Contributor Author

Thank you for everyone's input and good discussion. We don't need this PR anymore, which is replaced by #1959

@howard0su howard0su closed this Jun 21, 2023
@howard0su howard0su deleted the patch-1 branch June 28, 2023 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants