[Misc] CUDA graph 5 Add fatbin#464
Conversation
Add condition_kernel_fatbin.h, condition_kernel.cu, build script, and fatbin docs. cuda_graph_manager tries the fatbin before falling back to JIT linking with libcudadevrt.a. Made-with: Cursor
…fatbin Remove duplicate condition_kernel.cu (which had a single-indirection signature bug) and point the fatbin build script at the existing graph_do_while_cond.cu which has the correct double-indirection matching the calling code in graph_manager.cpp.
The pre-built fatbin covers all SM architectures that support conditional graph nodes (sm_90, sm_100, sm_120). The JIT fallback (embedded PTX + runtime linking with libcudadevrt.a) added complexity for a marginal benefit and required the CUDA toolkit on the user's system.
|
|
||
| driver.link_add_file(link_state, /*CU_JIT_INPUT_LIBRARY=*/4, | ||
| cudadevrt_path.c_str(), 0, nullptr, nullptr); | ||
| uint32_t ret = driver.module_load_data.call(&cond_kernel_module_, |
There was a problem hiding this comment.
To Opus: what does 'river.module_load_data.call(&cond_kernel_module_,
kConditionKernelFatbin);' do?
There was a problem hiding this comment.
Opus response:
It calls cuModuleLoadData(module_ptr, fatbin_data), which loads the fatbin into a CUDA module. The key difference is .call() vs operator():
• .call() (line 63) — calls the CUDA driver function and returns the error code as uint32. The caller checks the result.
• operator() (line 91) — calls the function and fatally errors (QD_ERROR_IF) if it fails.
The fatbin code uses .call() so it can check the return value and give a specific error message mentioning the SM version and the regeneration script, rather than getting a
generic CUDA driver error.
| REPO_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)" | ||
| SRC="${REPO_ROOT}/quadrants/runtime/cuda/graph_do_while_cond.cu" | ||
| OUT_DIR="${REPO_ROOT}/quadrants/runtime/cuda" | ||
| OBJ=$(mktemp /tmp/condition_kernel.XXXXXX.o) |
There was a problem hiding this comment.
questoin to Opus: will the sass work also on windows, if we built it on linux?
There was a problem hiding this comment.
Opus repsonse:
Yes. SASS is GPU machine code — it runs on the GPU, not the CPU. The GPU doesn't care what OS the host is running. The same sm_90 SASS runs identically on an H100 whether the
host is Linux or Windows. cuModuleLoadData loads the fatbin into the GPU the same way on both platforms.
This is actually another advantage of the fatbin approach over the old JIT path — the JIT path needed platform-specific logic to find libcudadevrt.a (Linux) vs cudadevrt.lib
(Windows), which is exactly the code we deleted in the merge conflict. The fatbin just works everywhere.
| FATBIN=$(mktemp /tmp/condition_kernel.XXXXXX.fatbin) | ||
|
|
||
| cleanup() { rm -f "${OBJ}" "${FATBIN}"; } | ||
| trap cleanup EXIT |
There was a problem hiding this comment.
to Opus: under what conditions does this run?
There was a problem hiding this comment.
from Opus:
trap cleanup EXIT runs the cleanup function whenever the shell exits, regardless of the reason:
• Script finishes normally
• set -e triggers an early exit on error (e.g. nvcc fails)
• Explicit exit call
• Signal-based termination (SIGINT/Ctrl-C, SIGTERM, etc. — because these cause the shell to exit, which triggers the EXIT trap)
So the temp files (${OBJ}, ${FATBIN}) are cleaned up in all cases. It's the standard pattern for temp file cleanup in bash scripts.
There was a problem hiding this comment.
I'd prefer this script in python. How would it look?
There was a problem hiding this comment.
from Opus:
#!/usr/bin/env python3
"""Builds the pre-compiled fatbin for the graph_do_while condition kernel.
Requires the CUDA toolkit (nvcc) to be installed.
The output is a C header (graph_do_while_cond_fatbin.h) containing the fatbin
as a byte array, which is compiled into the quadrants binary. At runtime
this lets us load the condition kernel without needing libcudadevrt.a on
the user's system.
Usage:
python scripts/build_condition_kernel_fatbin.py
See docs/source/user_guide/building_cudagraph_conditional_fatbin.md
"""
import shutil
import subprocess
import sys
import tempfile
from pathlib import Path
REPO_ROOT = Path(__file__).resolve().parent.parent
SRC = REPO_ROOT / "quadrants" / "runtime" / "cuda" / "graph_do_while_cond.cu"
OUT_HEADER = REPO_ROOT / "quadrants" / "runtime" / "cuda" / "graph_do_while_cond_fatbin.h"
SM_TARGETS = [
"-gencode=arch=compute_90,code=sm_90",
"-gencode=arch=compute_100,code=sm_100",
"-gencode=arch=compute_120,code=sm_120",
]
def find_nvcc() -> str:
nvcc = shutil.which("nvcc")
if nvcc:
return nvcc
for candidate in ["/usr/local/cuda/bin/nvcc"]:
if Path(candidate).exists():
return candidate
sys.exit("Error: nvcc not found. Install the CUDA toolkit and add it to PATH.")
def run(cmd: list[str]) -> None:
print(f" {' '.join(cmd)}")
subprocess.check_call(cmd)
def generate_header(fatbin_path: Path, out_path: Path) -> None:
data = fatbin_path.read_bytes()
lines = [
"// Auto-generated by scripts/build_condition_kernel_fatbin.py",
"// Do not edit manually. Regenerate with that script.",
"//",
"// Source: quadrants/runtime/cuda/graph_do_while_cond.cu",
f"// Size: {len(data)} bytes",
"",
"#pragma once",
"",
"#include <cstddef>",
"",
"static const unsigned char kConditionKernelFatbin[] = {",
]
for i in range(0, len(data), 16):
chunk = data[i : i + 16]
lines.append(" " + ", ".join(f"0x{b:02x}" for b in chunk) + ",")
lines.append("};")
lines.append("")
lines.append(f"static const std::size_t kConditionKernelFatbinSize = {len(data)};")
lines.append("")
out_path.write_text("\n".join(lines))
print(f"Wrote {out_path} ({len(data)} bytes)")
def main() -> None:
nvcc = find_nvcc()
with tempfile.TemporaryDirectory() as tmpdir:
obj = Path(tmpdir) / "condition_kernel.o"
fatbin = Path(tmpdir) / "condition_kernel.fatbin"
print("Compiling ...")
run([nvcc, "-dc", "-rdc=true", *SM_TARGETS, str(SRC), "-o", str(obj)])
print("Device-linking with libcudadevrt ...")
run([nvcc, "-dlink", *SM_TARGETS, str(obj), "-lcudadevrt", "-fatbin", "-o", str(fatbin)])
print("Generating C header ...")
generate_header(fatbin, OUT_HEADER)
print("Done.")
if __name__ == "__main__":
main()
| fatbin = Path(tmpdir) / "condition_kernel.fatbin" | ||
|
|
||
| print("Compiling ...") | ||
| run([nvcc, "-dc", "-rdc=true", *SM_TARGETS, str(SRC), "-o", str(obj)]) |
There was a problem hiding this comment.
question to Opus: waht does this do?
There was a problem hiding this comment.
opus response:
This is the device linking step. It resolves the cudaGraphSetConditional extern in the compiled object.
The flow is two stages:
- nvcc -dc -rdc=true — compiles graph_do_while_cond.cu into a relocatable device object. The call to cudaGraphSetConditional is left as an unresolved extern (because -rdc=true
means "I'll link later"). - nvcc -dlink — takes that object and links it against -lcudadevrt (which provides the cudaGraphSetConditional implementation), producing a self-contained fatbin with the
extern fully resolved. The -fatbin flag says to output a fat binary containing SASS for all the SM targets.
After this step, the fatbin has no remaining external dependencies — cudaGraphSetConditional is baked in. That's why we don't need libcudadevrt.a at runtime.
There was a problem hiding this comment.
what is the meaning of each commandeline option?
There was a problem hiding this comment.
Opus esponse:
For the -dlink command:
• -dlink — perform device linking (resolve cross-module device function references)
• *SM_TARGETS — the -gencode flags specifying which SM architectures to produce SASS for (sm_90, sm_100, sm_120)
• str(obj_path) — input: the relocatable device object from the compile step
• -lcudadevrt — link against NVIDIA's device runtime library (provides cudaGraphSetConditional)
• -fatbin — output a fatbin (fat binary containing SASS for multiple architectures) rather than a single cubin
• -o str(fatbin_path) — output file path
And for the -dc compile command:
• -dc — compile to device-only relocatable object (no host code, externs left unresolved)
• -rdc=true — enable relocatable device code (allows cross-module device function calls, required for calling cudaGraphSetConditional which lives in a separate library)
• *SM_TARGETS — same gencode flags
• str(SRC) — the .cu source file
• -o str(obj_path) — output file path
|
|
||
| def run(cmd: list[str]) -> None: | ||
| print(f" {' '.join(cmd)}") | ||
| subprocess.check_call(cmd) |
There was a problem hiding this comment.
what ahppens if this fails? do we see the full error messages?
There was a problem hiding this comment.
opus response:
Yes — check_call inherits the parent's stdout/stderr, so nvcc's error messages print in real time as the command runs. When it fails, Python then raises CalledProcessError
which adds the exit code. So you see both the full nvcc error output and the Python traceback confirming the non-zero exit.
If we had used capture_output=True or subprocess.run(..., capture_output=True), the error output would be hidden unless explicitly printed from the exception. The current
approach is the right one for a build script.
|
With the exception of the generated fatbin header file, I have read every line added in this PR, and reviewed the lines. I take responsibilty for the lines added and removed in this PR, and won't blame any issues on Opus. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2b72b298b9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| 0x33, 0x00, 0x42, 0x75, 0x69, 0x6c, 0x64, 0x20, 0x63, 0x75, 0x64, 0x61, 0x5f, 0x31, 0x32, 0x2e, | ||
| 0x38, 0x2e, 0x72, 0x31, 0x32, 0x2e, 0x38, 0x2f, 0x63, 0x6f, 0x6d, 0x70, 0x69, 0x6c, 0x65, 0x72, | ||
| 0x2e, 0x33, 0x35, 0x35, 0x38, 0x33, 0x38, 0x37, 0x30, 0x5f, 0x30, 0x00, 0x2d, 0x61, 0x72, 0x63, | ||
| 0x68, 0x20, 0x73, 0x6d, 0x5f, 0x31, 0x30, 0x30, 0x20, 0x2d, 0x6d, 0x20, 0x36, 0x34, 0x20, 0x2d, |
There was a problem hiding this comment.
Regenerate fatbin with SM90 code image
This generated blob advertises an -arch sm_100 image (and sm_120 later in the same file) but contains no sm_90 image, while the runtime still treats all cc >= 90 devices as natively supported. On Hopper-class GPUs (SM90), cuModuleLoadData will fail and graph-do-while raises instead of running, so this change regresses a previously supported architecture. Please regenerate the checked-in fatbin with an SM90 target (or narrow the runtime support gate).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
curious. It parsed this into binary, and then interpreated that? 😲
There was a problem hiding this comment.
The authro agent says:
see, another agent says it is missing 90:
"[Pasted text #1 +31 lines]
"
The other agent's methodology is wrong. Using strings to look for ASCII text like "sm_90" in a binary is unreliable — the SM version is encoded in binary ELF headers, not
necessarily as an ASCII string. The strings output they found (like -arch sm_100 -m 64) are nvcc command-line fragments embedded in per-section metadata, not an exhaustive
list of architectures.
cuobjdump -lelf is the authoritative tool for inspecting fatbin contents — it actually parses the fatbin format. And it clearly shows all three:
ELF file 1: check_fatbin.1.sm_90.cubin
ELF file 2: check_fatbin.2.sm_100.cubin
ELF file 3: check_fatbin.3.sm_120.cubin
The fatbin is correct. sm_90 is present.
There was a problem hiding this comment.
how can I verify the header file myself? what command?
There was a problem hiding this comment.
Extract the bytes from the header and run cuobjdump:
cd /path/to/quadrants
python3 -c "
import re
from pathlib import Path
text = Path('quadrants/runtime/cuda/graph_do_while_cond_fatbin.h').read_text()
m = re.search(r'kConditionKernelFatbin[] = {([^}]+)}', text, re.DOTALL)
data = bytes(int(x.strip(), 16) for x in m.group(1).split(',') if x.strip())
Path('/tmp/check.fatbin').write_bytes(data)
" && cuobjdump -lelf /tmp/check.fatbin
There was a problem hiding this comment.
My output:
[Quadrants Build] hugh@hugh5090:~/git/quadrants$ export PATH=$PATH:/usr/local/cuda/bin
[Quadrants Build] hugh@hugh5090:~/git/quadrants$ python3 -c "
import re
from pathlib import Path
text = Path('quadrants/runtime/cuda/graph_do_while_cond_fatbin.h').read_text()
m = re.search(r'kConditionKernelFatbin\[\] = \{([^}]+)\}', text, re.DOTALL)
data = bytes(int(x.strip(), 16) for x in m.group(1).split(',') if x.strip())
Path('/tmp/check.fatbin').write_bytes(data)
" && cuobjdump -lelf /tmp/check.fatbin
ELF file 1: check.1.sm_90.cubin
ELF file 2: check.2.sm_100.cubin
ELF file 3: check.3.sm_120.cubin
| for candidate in ["/usr/local/cuda/bin/nvcc"]: | ||
| if Path(candidate).exists(): | ||
| return candidate | ||
| sys.exit("Error: nvcc not found. Install the CUDA toolkit and add it to PATH.") |
There was a problem hiding this comment.
Resolve nvcc from CUDA_HOME before failing
find_nvcc() checks only PATH and /usr/local/cuda/bin/nvcc, then exits; it never consults CUDA_HOME/CUDA_PATH. That breaks fatbin regeneration on common non-default toolkit installs where users rely on CUDA_HOME, making the new workflow unexpectedly fail even though CUDA is installed.
Useful? React with 👍 / 👎.
| def find_nvcc() -> str: | ||
| nvcc = shutil.which("nvcc") | ||
| if nvcc: | ||
| return nvcc | ||
| for candidate in ["/usr/local/cuda/bin/nvcc"]: | ||
| if Path(candidate).exists(): | ||
| return candidate | ||
| sys.exit("Error: nvcc not found. Install the CUDA toolkit and add it to PATH.") |
There was a problem hiding this comment.
🔴 The docs promise that setting CUDA_HOME is sufficient to find nvcc, but find_nvcc() never reads that variable — users who follow the documented instructions will hit "Error: nvcc not found" even when ${CUDA_HOME}/bin/nvcc exists. Fix by adding os.environ.get("CUDA_HOME") and os.environ.get("CUDA_PATH") to the candidates list in find_nvcc().
Extended reasoning...
What the bug is
docs/source/user_guide/building_cudagraph_conditional_fatbin.md (line 34) states: "The nvcc binary must be on your PATH, or set CUDA_HOME". This creates a clear user expectation that setting CUDA_HOME is a supported alternative to having nvcc on PATH.
The specific code path that fails
find_nvcc() in scripts/build_condition_kernel_fatbin.py (lines 31–38) does exactly two things:
shutil.which("nvcc")— searchesPATH- Checks the hardcoded path
/usr/local/cuda/bin/nvcc
It never calls os.environ.get("CUDA_HOME") or os.environ.get("CUDA_PATH"). The environment variables are simply ignored.
Why existing code doesn't prevent it
There is no fallback or warning. If nvcc isn't on PATH and /usr/local/cuda/bin/nvcc doesn't exist, the script calls sys.exit("Error: nvcc not found. Install the CUDA toolkit and add it to PATH.") — and the error message itself only mentions PATH, contradicting the docs that also mention CUDA_HOME.
Impact
This is a common setup in HPC clusters and Docker/container environments where the CUDA toolkit is installed under a non-standard prefix (e.g., /opt/cuda, /usr/local/cuda-12.4) and CUDA_HOME is set by the module system or container entrypoint, but the bin directory is not on PATH. Any developer following the documented workflow will get a misleading error.
How to fix
Add CUDA_HOME and CUDA_PATH lookups to the candidates list in find_nvcc():
import os
def find_nvcc() -> str:
nvcc = shutil.which("nvcc")
if nvcc:
return nvcc
candidates = [
os.path.join(os.environ[v], "bin", "nvcc")
for v in ("CUDA_HOME", "CUDA_PATH")
if v in os.environ
]
candidates.append("/usr/local/cuda/bin/nvcc")
for candidate in candidates:
if Path(candidate).exists():
return candidate
sys.exit("Error: nvcc not found. Install the CUDA toolkit and add it to PATH, or set CUDA_HOME.")Step-by-step proof
- User installs CUDA 12.4 under
/opt/cuda-12.4. - Their shell sets
CUDA_HOME=/opt/cuda-12.4but does not add/opt/cuda-12.4/bintoPATH. - User reads the docs: "set CUDA_HOME" — they believe they are set up correctly.
- User runs
python scripts/build_condition_kernel_fatbin.py. shutil.which("nvcc")returnsNone(not on PATH).Path("/usr/local/cuda/bin/nvcc").exists()returnsFalse(wrong prefix).sys.exit("Error: nvcc not found...")terminates — even though/opt/cuda-12.4/bin/nvccexists.
|
|
||
| ## Prerequisites | ||
|
|
||
| - CUDA toolkit with `nvcc` (12.4 or later, for conditional node support). |
There was a problem hiding this comment.
🟡 The Prerequisites section states "CUDA toolkit with nvcc (12.4 or later)" but the script targets SM 100 (Blackwell, requires CUDA 12.6+) and SM 120 (RTX 5090, requires CUDA 12.8+), so any developer with CUDA 12.4–12.5 will hit nvcc "unsupported gpu architecture" errors. The minimum version should be raised to at least 12.8 to compile all three SM targets.
Extended reasoning...
What the bug is: The Prerequisites section of docs/source/user_guide/building_cudagraph_conditional_fatbin.md (line 33) states: CUDA toolkit with nvcc (12.4 or later, for conditional node support). This version floor is too low for two of the three SM architectures targeted by the regeneration script.
The specific code path that triggers it: scripts/build_condition_kernel_fatbin.py defines SM_VERSIONS = [90, 100, 120] and expands these into three -gencode=arch=compute_N,code=sm_N flags passed to nvcc. nvcc's support for these architectures was introduced progressively:
- SM 90 (Hopper / H100): introduced in CUDA 11.8
- SM 100 (Blackwell B100/B200): introduced in CUDA 12.6
- SM 120 (Blackwell Ultra / RTX 5090 GB202): introduced in CUDA 12.8
Why existing code doesn't prevent it: The documentation simply asserts 12.4 as sufficient, and the script does no version check before invoking nvcc. Nothing catches the mismatch until nvcc itself emits the error.
What the impact would be: A developer who installs CUDA 12.4 or 12.5 and runs python scripts/build_condition_kernel_fatbin.py will see nvcc errors such as unsupported gpu architecture 'compute_100' and unsupported gpu architecture 'compute_120'. The script will abort without producing the fatbin, and the developer will have wasted time on a misleading prerequisite.
How to fix it: Change line 33 from 12.4 or later to 12.8 or later (since SM 120 requires CUDA 12.8). Optionally, add a note listing the per-architecture minimums for clarity, or add a version check in find_nvcc().
Step-by-step proof:
- Developer reads the Prerequisites section and installs CUDA 12.5 (satisfies the stated ≥12.4 requirement).
- Developer runs
python scripts/build_condition_kernel_fatbin.py. find_nvcc()finds nvcc from CUDA 12.5.- Script calls:
nvcc -dc -rdc=true -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_120,code=sm_120 ... - nvcc 12.5 knows sm_90 but does not know sm_100 or sm_120 — it was released before Blackwell support.
- nvcc exits with error:
nvcc fatal: Unsupported gpu architecture 'compute_100'. subprocess.check_callraisesCalledProcessError; no fatbin is produced.- Developer is confused because they followed the documented prerequisite exactly.
| header_lines.append("") | ||
| header_lines.append(f"static const std::size_t kConditionKernelFatbinSize = {len(fatbin_data)};") | ||
| header_lines.append("") | ||
| out_path.write_text("\n".join(header_lines)) |
There was a problem hiding this comment.
🟡 The write_text call at line 68 of generate_header omits newline='\n', so on Windows Python's text mode translates every LF to CRLF, producing a generated header with Windows line endings. A Windows developer who regenerates and commits the fatbin header will introduce CRLF endings into the repo, causing git to show the entire file as changed on Linux pulls. Fix: out_path.write_text('\n'.join(header_lines), encoding='utf-8', newline='\n').
Extended reasoning...
What the bug is and how it manifests
generate_header builds the graph_do_while_cond_fatbin.h header using out_path.write_text("\n".join(header_lines)). Python's Path.write_text opens the file in text mode with the platform-default encoding and newline translation. On Windows, text-mode writes translate every LF (\n) to CRLF (\r\n). This means the generated header will have CRLF line endings whenever the script is run on Windows.
The specific code path
scripts/build_condition_kernel_fatbin.py line 68: out_path.write_text("\n".join(header_lines)). The header_lines list is joined with \n separators, and a trailing \n is appended as the last element. On Linux/macOS, write_text passes these through unmodified. On Windows, each \n becomes \r\n.
Why existing code doesn't prevent it
There is no .gitattributes file in the repo (confirmed by verifiers) to enforce LF normalization on commit. If a Windows developer has core.autocrlf=false or core.autocrlf=input, git will not normalize the line endings, and the committed file will contain CRLF endings.
Impact
Any Linux developer (or CI) that pulls after a Windows regeneration will see the entire graph_do_while_cond_fatbin.h as modified in git diff, even though no functional content changed. This creates noisy diffs and confusion about whether the fatbin was intentionally updated.
How to fix
Change the call to: out_path.write_text("\n".join(header_lines), encoding="utf-8", newline="\n"). Alternatively, use out_path.write_bytes("\n".join(header_lines).encode("utf-8")) which bypasses text-mode translation entirely.
Step-by-step proof
- Developer on Windows runs
python scripts/build_condition_kernel_fatbin.py. generate_headerassemblesheader_lines = ["// Auto-generated ...", ..., ""]and callsout_path.write_text("\n".join(header_lines)).- Python opens the file in text mode. Windows text mode translates each
\nwrite to a\r\nbyte sequence. - The written file contains
// Auto-generated ...\r\n// Do not edit ...\r\n...etc. - Developer runs
git diff— entire file shows as changed compared to the LF-only version checked in from Linux. - Developer commits; CI on Linux now also shows the file as modified.
| ## Adding a new SM architecture | ||
|
|
||
| Edit the `SM_TARGETS` list in `scripts/build_condition_kernel_fatbin.py` to | ||
| add the new `-gencode` entry, then regenerate. | ||
|
|
There was a problem hiding this comment.
🟡 The "Adding a new SM architecture" section tells users to "Edit the SM_TARGETS list" but SM_TARGETS is a derived list auto-computed via a list comprehension — the user-editable variable is SM_VERSIONS = [90, 100, 120]. Update the docs to say "Edit the SM_VERSIONS list ... to add the new SM version number (e.g., 130)".
Extended reasoning...
What the bug is: The documentation section "Adding a new SM architecture" (line 58 of the new doc file) instructs users to "Edit the SM_TARGETS list in scripts/build_condition_kernel_fatbin.py to add the new -gencode entry, then regenerate." However, this is incorrect — SM_TARGETS is not the user-editable variable.
The specific code path: In scripts/build_condition_kernel_fatbin.py, the script defines two variables:
- Line 27:
SM_VERSIONS = [90, 100, 120]— a simple list of integer SM version numbers, intended to be edited by the user - Line 28:
SM_TARGETS = [f"-gencode=arch=compute_{v},code=sm_{v}" for v in SM_VERSIONS]— a derived list of full gencode strings, auto-computed fromSM_VERSIONSvia a list comprehension
Why existing code doesn't prevent it: The doc was written referencing the original Opus-generated script (visible in the PR timeline at comment 3059546304), which had SM_TARGETS as a hand-edited list of full gencode strings. Later in the review, hughperkins asked "cant we just specify versions, rather than the entire string?" (comment 3059569723, marked resolved), which led to the refactoring that introduced SM_VERSIONS. The docs were not updated to reflect this change.
What the impact would be: A developer following these instructions to add SM 130 support would search build_condition_kernel_fatbin.py for SM_TARGETS expecting to find a list of integers. Instead they find a list comprehension expression. While they could technically append to SM_TARGETS directly (it's just a Python list at runtime), this would bypass the SM_VERSIONS design intent, and the doc's phrasing "add the new -gencode entry" matches the old style, not the current one. The correct action is to add an integer to SM_VERSIONS.
How to fix it: Update line 58 of docs/source/user_guide/building_cudagraph_conditional_fatbin.md from:
Edit the
SM_TARGETSlist inscripts/build_condition_kernel_fatbin.pyto add the new-gencodeentry, then regenerate.
To:
Edit the
SM_VERSIONSlist inscripts/build_condition_kernel_fatbin.pyto add the new SM version number (e.g.,130), then regenerate.
Step-by-step proof: (1) User reads doc, sees "Edit the SM_TARGETS list". (2) User opens scripts/build_condition_kernel_fatbin.py and searches for SM_TARGETS. (3) They find SM_TARGETS = [f"-gencode=arch=compute_{v},code=sm_{v}" for v in SM_VERSIONS] — a list comprehension, not a hand-edited list. (4) They are confused about how to add a new entry. (5) The correct action — adding 130 to SM_VERSIONS = [90, 100, 120] — is not mentioned anywhere in the doc.
…ne endings - find_nvcc() now checks CUDA_HOME/CUDA_PATH env vars before fallback - Fix CUDA toolkit prerequisite: 12.4 -> 12.8 (needed for SM 120) - Use newline="\n" in write_text to avoid CRLF on Windows - Docs: reference SM_VERSIONS (not SM_TARGETS) for adding architectures
|
Thanks! |
…dd-fatbin # Conflicts: # quadrants/runtime/cuda/graph_manager.cpp
Issue: #
Brief Summary
copilot:summary
Ran unit tests on 5090 (SM120):
Note: I also ran all genesis tests:
Walkthrough
copilot:walkthrough