RVV1.0 Supported tests for RISC-V#16682
Conversation
|
@alitariq4589 Any progress on this? Also, it looks like the runner has gone offline... |
|
@CISC There was a minor internet outage. It is fixed now. Can you check and let me know if the board still shows offline? If that is the case, I will restart the container.
I had ordered and integrated 9 RISC-V boards with RVV1.0 so all tests can run smoothly, I am currently testing the ccache by running multiple tests if it works. Here is the build. As soon as I am done testing, I will open the PR for review. Additionally I will also need a single runner token for adding all those boards in llama.cpp repository once I open the PR for review. It would be great if we have some faster means of communication other than issues and emails. Do you use some other messaging platform (discord, mastodon etc.?). If it is okay, you can also join this discord server. |
I cancelled all the old jobs, but there are currently 2 new ones queued and not picked up yet.
Great, ping Georgi when you do.
Sorry, email only. |
|
@CISC I have restarted the runner and it is picking up the jobs again |
Corrections included: 1. Changed the test names from debian to ubuntu as it is more stable than Debian Trixie 2. Added explicit compiler in cmake command as GCC compiler below version 14 have been recorded to throw errors with rvv1.0 and some other extensions 3. Added dependencies which are not installed by default in the RISC-V Ubuntu 24.04 4. Separate ccache directory for all jobs as all the ccache results are not the same and may cause ccache to not work
|
@CISC This PR is ready for review. I have excluded The result of the added builds can be seen here in my fork. There are multiple attempts, so you can check each of them. As the number of boards integrated is greater than the number of tests ported for RISC-V (because there may be some builds in the queue to, these boards will offload some time), the ccache effect is not immediated visible in 4 attempts, but I have tested individually and so far, according to the stats, the ccache seems to be working (check this and this job builds too for ccache results in which I executed each job multiple times for checking ccache results). Following RISC-V boards will be integrated once I get the token.
For checking the resource utilization, I have set up grafana to track usage. This tracking site tracks the usage of the host machine and not the containers in which the builds will be running. Use the following links to view resource usage.
NOTE: Since our network engineer is out of the office for the next couple of days, @ggerganov, please share a github runner token when you can, and I will use that to register all these boards for builds. The token will be valid for one hour after generation, so let me know as soon as you generate it. You can send it to me at my email. Let me know if anyone has any questions 🙂 |
ggerganov
left a comment
There was a problem hiding this comment.
@alitariq4589 Sending you the token in a min
| <<<<<<< HEAD | ||
| CMAKE_EXTRA="-DLLAMA_FATAL_WARNINGS=${LLAMA_FATAL_WARNINGS:-ON} -DLLAMA_CURL=ON" | ||
| ======= | ||
| CMAKE_EXTRA="-DLLAMA_FATAL_WARNINGS=ON -DLLAMA_CURL=ON -DGGML_SCHED_NO_REALLOC=ON" | ||
| >>>>>>> master |
There was a problem hiding this comment.
Let me check.
BTW, the background of this change is that, since there is a warning (in the very first comment), tests were failing so I had to turn the "warnings as errors" off for CI to pass. I think we also need to create an issue and ping the contributor for this change. Maybe the RVV1.0 intrinsics change somewhere which is causing this error.
There was a problem hiding this comment.
@xctan Could you take a look at this compile warning?
In file included from ../../../ggml/src/ggml-cpu/arch/riscv/quants.c:6:
../../../ggml/src/ggml-cpu/simd-mappings.h: In function 'riscv_compute_fp16_to_fp32':
../../../ggml/src/ggml-cpu/simd-mappings.h:101:9: error: ISO C does not support the '_Float16' type before C23 [-Werror=pedantic]
101 | _Float16 hf;
| ^~~~~~~~
../../../ggml/src/ggml-cpu/simd-mappings.h: In function 'riscv_compute_fp32_to_fp16':
../../../ggml/src/ggml-cpu/simd-mappings.h:108:9: error: ISO C does not support the '_Float16' type before C23 [-Werror=pedantic]
108 | _Float16 hf = (_Float16)f;
| ^~~~~~~~
../../../ggml/src/ggml-cpu/simd-mappings.h:108:24: error: ISO C does not support the '_Float16' type before C23 [-Werror=pedantic]
108 | _Float16 hf = (_Float16)f;
| ^~~~~~~~
cc1: all warnings being treated as errors
There was a problem hiding this comment.
@ggerganov strangely I dont see conflicts after merging. I dont see conflicts here on the github ui too. Are you testing this on older commit? I just merged master branch to my branch and it seems okay.
There was a problem hiding this comment.
You have committed the merge conflict text, it just needs cleaning up.
There was a problem hiding this comment.
The _Float16 type seems to be the only way to get the compiler's built-in code generation to work for Zfh and Zvfh extensions. The catch is, this type is only available starting with the ISO C C23 standard. Otherwise, we'd have to resort to inline assembly, which isn't ideal for register usage. Plus, vector intrinsic functions also need this type, and float16_t is exclusively defined in C++ headers (since C++23). So, given all that, maybe we can just disable this diagnostic when _Float16 types are being used?
There was a problem hiding this comment.
So, given all that, maybe we can just disable this diagnostic when _Float16 types are being used?
Sounds good
|
@CISC the @ggerganov Thanks for sharing the token. Can you please confirm the following added runners in the repository settings as online? jupiter-16G-1 |
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
You tell me. :) If it is of no further value to you it can be removed. |
|
@CISC I am sorry for the inconvenience. I will now upgrade the runner to the latest version, and (hopefully) all these issues will be resolved. According to what I assessed in the past couple of days, these are happening because of the following discrepancies.
I will start the upgrade process. Hopefully, there won't be any interruptions as I will upgrade the runners one by one. I should be done in 5-6 hours if no problem is faced. I will add a comment once I am done with this. |
|
I would also like to mention that pytorch is now released for riscv. I have set up a workflow which fetches the source code from official upstream, builds and releases it without adding any kind of change. That means the RISC-V workflows which require pytorch for RISC-V can now be added. Pytorch release link: https://github.com/alitariq4589/pytorch-riscv/releases |
Nice, thanks for following up. :) |
|
@CISC, I noticed that there are several jobs in the queue. I was waiting for the runner to complete all the jobs so I can patch it, but looking at the logs, i dont think that is going to happen 😅 . So if I try to patch the runner in this state, the running job will be cancelled. Is it okay if I cancel the job running on the boards to patch them? You can also tell me a specific time to upgrade the packages when you think it will not affect the CI and PRs considerably. |
Yeah, quite unlikely at this point. :D
Just go ahead, if it's crucial for a PR we can rerun them. |
|
@CISC All the runners are now upgraded to version 2.331.0. Hopefully, this will resolve all the disconnection/cancellation issues. The only concern now remains of the job cancellation when a new version of GitHub Runner launches. I will keep a watch, but let me know if you observe any cancellations. Additionally, I will be taking a backup of the images tomorrow, so in case of corruption, it may be possible to restore the image. Since backing up with ccache consumes a lot of space, I will clear ccache and take a backup of the images. So the runners will take some CI builds to fill the ccache again. Let me know if you see any anomalies. |
@alitariq4589 Got a few in a row now: |
|
@CISC Thanks for informing me and keeping a check on these errors. I am sorry again for the inconvinience. I have found the cause of the issue. The runner is checking for newer version of github actions and cannot find one because of unknown (RISC-V) ISA.
I will add a patch for this and will let you know once this issue is solved. Since, even after adding the latest version, it still tries to fetch/check a new version, I will disable this check of checking for updates entirely within the github actions source code. |
|
@CISC Thanks a lot for informing me about cancellations. I have just updated all the runners with a new patch of github runner source code, which disables the auto updater (here is the workflow if you would like to have a look yourself). I also added pending restart functionality for the runner containers to restart only when no jobs are running, but due to GitHub Actions picking up jobs quickly from queue, you may have seen some cancellations. I hope this will solve all the cancellation issues. But if they do appear, please let me know. |
@alitariq4589 https://github.com/ggml-org/llama.cpp/actions/runs/22139370809/job/63999404175?pr=19660 |
This is a strange kind of error that I didn't encounter before. I think this is something more related to github server side than the runner side.
|
|
I also noticed that two runners were not properly upgraded. I have updated their file,s and they will automatically restart once no jobs are running. |
Yep, probably, we had some other strange failures as well, think GitHub had a little hiccup. |
|
@CISC Did you notice any kind of issues after that? (disconnections, failures, cancellations, etc. due to GitHub Runner package) |
Just a few infrequent weird failures, hard to tell why, otherwise all good: |
@alitariq4589 Ok, it's been happening a lot today: |
Thank you for pointing that out. I am checking the logs. Can you please also provide me with other jobs that had this kind of failure? I am trying to see the similarities in these failures in the debug logs. |
Sure, here's another: Also this one at Post Clone: |
This is a segmentation fault (error code 139), as I have seen in the diag logs. According to my understanding of this behavior, this is again coming from the .NET. I have added an issue in the .NET release of RISC-V. Let's see what they say about this. One other thing is that the Ubuntu image which I am using for this is not official from Canonical (because when I created the image for github runner, Canonical did not release any image at that time for riscv). I will change the image inside every container to the official Ubuntu LTS image, but it is going to take some time (around a week) because it is a manual effort for every container. I will get back to you as soon as I finish it. |
|
@alitariq4589 No space left on device: |
|
@CISC I have created some space on the device. Also, @luhenry and I have set up RISC-V CI infrastructure in a github app with ephemeral runners under RISE CI enablement project. So instead of the manually added runners, you will be able to install the apps, and that will allocate a runner. For the resolution of all the issues, these added runners in Llama.cpp will soon be moved to the RISE pool of GitHub runners. I will be able to inform you about the exact downtime once we have figured out the best approach to migrate the CI runners to the RISE github app. |
@CISC You can find all the information about these RISE RISC-V Runners at https://riseproject-dev.github.io/riscv-runner/. The announcement here |
|
@CISC We will start the migration process to move the RISC-V machines to RISE runners as discussed above. This will be a rolling migration (meaning boards will be taken down and added to the github app one after the other). This will not cause considerable downtime, but the running jobs on the board under migration will be cancelled. Before starting the process, can you please install the RISE RISC-V Runners app so that the added boards automatically pick up new jobs and prevent downtime? Also, this app is configured to be installed at the organization level, not at the individual repository level. Let me know once you have installed the github app. |
cc/ @ggerganov |
|
The app has been added to the |
|
@CISC We (I and @luhenry ) are migrating the boards from conventional runners to GitHub App, which is installed. You may see some cancellations because waiting for them to be free causes us to wait for a long time. You can then rerun the jobs, and they will automatically be scheduled on the newly added runners. I will ping here once the process is complelete. |
|
@CISC the migration is mostly complete. All the Jupiter boards have been migrated. We only have a Banana-Pi left but it doesn’t seem critical. There is still the ccache issue, I hope to have progress done this week. |
* Added RISC-V supported tests * Added default value for LLAMA_FATAL_WARNINGS and option to specify by user * Added RISC-V supported tests * Added default value for LLAMA_FATAL_WARNINGS and option to specify by user * Removed apt prompt * Added RISC-V specific tests with corrections Corrections included: 1. Changed the test names from debian to ubuntu as it is more stable than Debian Trixie 2. Added explicit compiler in cmake command as GCC compiler below version 14 have been recorded to throw errors with rvv1.0 and some other extensions 3. Added dependencies which are not installed by default in the RISC-V Ubuntu 24.04 4. Separate ccache directory for all jobs as all the ccache results are not the same and may cause ccache to not work * Resolved the merge conflict and cleaned up run.sh * Update ci/run.sh Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Removed previously added build ci for RISC-V * Removed trailing whitespaces * corrected build name Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * cleanup * Enabled build tests (1) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Enabled build tests (2) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * enable openssl --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>



This PR adds tests supported for RISC-V
Tests which are added for execution with RISC-V
debian-cpu-cmake-rv64-nativedebian-trixie-cmake-sanitizer-riscv64-native* 3debian-trixie-llguidance-riscv64-nativedebian-trixie-cmake-rpc-riscv64-nativeggml-ci-riscv64-native-cpu-low-perfDependencies which are not yet supported for RISC-V:
rocblas-devhipblas-devvulkan-sdkmthreads/musa:rc4.3.0-devel-ubuntu22.04-amd64intel-oneapi-compiler-dpcpp-cppintel-oneapi-mkl-develcudamacOS*windows*torchTests which are not added for RISC-V due to above unmet dependencies:
macOS-latest-cmake-arm64macOS-latest-cmake-x64macOS-latest-cmake-arm64-webgpuubuntu-24-cmake-vulkanubuntu-22-cmake-webgpuubuntu-22-cmake-hipubuntu-22-cmake-musaubuntu-22-cmake-syclubuntu-22-cmake-sycl-fp16macOS-latest-cmake-iosmacOS-latest-cmake-tvosmacOS-latest-cmake-visionosmacOS-latest-swiftwindows-msys2windows-latest-cmakeubuntu-latest-cmake-cudawindows-2022-cmake-cudawindows-latest-cmake-syclwindows-latest-cmake-hipios-xcode-buildandroid-buildopenEuler-latest-cmake-cannggml-ci-x64-nvidia-cudaggml-ci-x64-nvidia-vulkan-cmggml-ci-x64-nvidia-vulkan-cm2ggml-ci-x64-cpu-amxggml-ci-mac-metalggml-ci-mac-vulkanggml-ci-arm64-cpu-high-perf-sveggml-ci-riscv64-native-cpu-high-perfAdditional Notes
Note 1
Due to a warning (treated as error) related to RISC-V simd mappings
-DLLAMA_FATAL_WARNINGS=ONhas to be turned off for all the tests otherwise CI fails(This can be created a separated issue. I can track the contributor and ping him if you want)
Note 2
One RISC-V board may not be optimal for running all these tests so by the end of RISC-V summit North America, we are expecting more boards to arrive (around mid of November). So this PR can be treated as draft for reviews till then.