Skip to content

[Misc] CUDA graph 5 Add fatbin#464

Merged
hughperkins merged 24 commits intomainfrom
hp/cuda-graph-mvp-5-add-fatbin
Apr 10, 2026
Merged

[Misc] CUDA graph 5 Add fatbin#464
hughperkins merged 24 commits intomainfrom
hp/cuda-graph-mvp-5-add-fatbin

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

@hughperkins hughperkins commented Apr 9, 2026

Issue: #

Brief Summary

copilot:summary

Ran unit tests on 5090 (SM120):

Screenshot 2026-04-09 at 13 25 23 Screenshot 2026-04-09 at 13 25 26 Screenshot 2026-04-09 at 13 25 29 Screenshot 2026-04-09 at 13 25 34

Note: I also ran all genesis tests:

Screenshot 2026-04-09 at 14 03 57

Walkthrough

copilot:walkthrough

Add condition_kernel_fatbin.h, condition_kernel.cu, build script,
and fatbin docs. cuda_graph_manager tries the fatbin before falling
back to JIT linking with libcudadevrt.a.

Made-with: Cursor
…fatbin

Remove duplicate condition_kernel.cu (which had a single-indirection
signature bug) and point the fatbin build script at the existing
graph_do_while_cond.cu which has the correct double-indirection
matching the calling code in graph_manager.cpp.
The pre-built fatbin covers all SM architectures that support conditional
graph nodes (sm_90, sm_100, sm_120). The JIT fallback (embedded PTX +
runtime linking with libcudadevrt.a) added complexity for a marginal
benefit and required the CUDA toolkit on the user's system.
@hughperkins hughperkins changed the title Hp/cuda graph mvp 5 add fatbin [Misc] CUDA graph 5 Add fatbin Apr 9, 2026
Comment thread docs/source/user_guide/building_cudagraph_conditional_fatbin.md Outdated
Comment thread quadrants/runtime/cuda/graph_manager.cpp Outdated

driver.link_add_file(link_state, /*CU_JIT_INPUT_LIBRARY=*/4,
cudadevrt_path.c_str(), 0, nullptr, nullptr);
uint32_t ret = driver.module_load_data.call(&cond_kernel_module_,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To Opus: what does 'river.module_load_data.call(&cond_kernel_module_,
kConditionKernelFatbin);' do?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opus response:

It calls cuModuleLoadData(module_ptr, fatbin_data), which loads the fatbin into a CUDA module. The key difference is .call() vs operator():
.call() (line 63) — calls the CUDA driver function and returns the error code as uint32. The caller checks the result.
operator() (line 91) — calls the function and fatally errors (QD_ERROR_IF) if it fails.

The fatbin code uses .call() so it can check the return value and give a specific error message mentioning the SM version and the regeneration script, rather than getting a
generic CUDA driver error.

Comment thread scripts/build_condition_kernel_fatbin.sh Outdated
REPO_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
SRC="${REPO_ROOT}/quadrants/runtime/cuda/graph_do_while_cond.cu"
OUT_DIR="${REPO_ROOT}/quadrants/runtime/cuda"
OBJ=$(mktemp /tmp/condition_kernel.XXXXXX.o)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

questoin to Opus: will the sass work also on windows, if we built it on linux?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opus repsonse:

Yes. SASS is GPU machine code — it runs on the GPU, not the CPU. The GPU doesn't care what OS the host is running. The same sm_90 SASS runs identically on an H100 whether the
host is Linux or Windows. cuModuleLoadData loads the fatbin into the GPU the same way on both platforms.
This is actually another advantage of the fatbin approach over the old JIT path — the JIT path needed platform-specific logic to find libcudadevrt.a (Linux) vs cudadevrt.lib
(Windows), which is exactly the code we deleted in the merge conflict. The fatbin just works everywhere.

FATBIN=$(mktemp /tmp/condition_kernel.XXXXXX.fatbin)

cleanup() { rm -f "${OBJ}" "${FATBIN}"; }
trap cleanup EXIT
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to Opus: under what conditions does this run?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from Opus:

trap cleanup EXIT runs the cleanup function whenever the shell exits, regardless of the reason:
• Script finishes normally
• set -e triggers an early exit on error (e.g. nvcc fails)
• Explicit exit call
• Signal-based termination (SIGINT/Ctrl-C, SIGTERM, etc. — because these cause the shell to exit, which triggers the EXIT trap)

So the temp files (${OBJ}, ${FATBIN}) are cleaned up in all cases. It's the standard pattern for temp file cleanup in bash scripts.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer this script in python. How would it look?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from Opus:

#!/usr/bin/env python3
  """Builds the pre-compiled fatbin for the graph_do_while condition kernel.
  Requires the CUDA toolkit (nvcc) to be installed.
  The output is a C header (graph_do_while_cond_fatbin.h) containing the fatbin
  as a byte array, which is compiled into the quadrants binary. At runtime
  this lets us load the condition kernel without needing libcudadevrt.a on
  the user's system.
  Usage:
      python scripts/build_condition_kernel_fatbin.py
  See docs/source/user_guide/building_cudagraph_conditional_fatbin.md
  """
  import shutil
  import subprocess
  import sys
  import tempfile
  from pathlib import Path
  REPO_ROOT = Path(__file__).resolve().parent.parent
  SRC = REPO_ROOT / "quadrants" / "runtime" / "cuda" / "graph_do_while_cond.cu"
  OUT_HEADER = REPO_ROOT / "quadrants" / "runtime" / "cuda" / "graph_do_while_cond_fatbin.h"
  SM_TARGETS = [
      "-gencode=arch=compute_90,code=sm_90",
      "-gencode=arch=compute_100,code=sm_100",
      "-gencode=arch=compute_120,code=sm_120",
  ]
  def find_nvcc() -> str:
      nvcc = shutil.which("nvcc")
      if nvcc:
          return nvcc
      for candidate in ["/usr/local/cuda/bin/nvcc"]:
          if Path(candidate).exists():
              return candidate
      sys.exit("Error: nvcc not found. Install the CUDA toolkit and add it to PATH.")
  def run(cmd: list[str]) -> None:
      print(f"  {' '.join(cmd)}")
      subprocess.check_call(cmd)
  def generate_header(fatbin_path: Path, out_path: Path) -> None:
      data = fatbin_path.read_bytes()
      lines = [
          "// Auto-generated by scripts/build_condition_kernel_fatbin.py",
          "// Do not edit manually. Regenerate with that script.",
          "//",
          "// Source: quadrants/runtime/cuda/graph_do_while_cond.cu",
          f"// Size: {len(data)} bytes",
          "",
          "#pragma once",
          "",
          "#include <cstddef>",
          "",
          "static const unsigned char kConditionKernelFatbin[] = {",
      ]
      for i in range(0, len(data), 16):
          chunk = data[i : i + 16]
          lines.append("    " + ", ".join(f"0x{b:02x}" for b in chunk) + ",")
      lines.append("};")
      lines.append("")
      lines.append(f"static const std::size_t kConditionKernelFatbinSize = {len(data)};")
      lines.append("")
      out_path.write_text("\n".join(lines))
      print(f"Wrote {out_path} ({len(data)} bytes)")
  def main() -> None:
      nvcc = find_nvcc()
      with tempfile.TemporaryDirectory() as tmpdir:
          obj = Path(tmpdir) / "condition_kernel.o"
          fatbin = Path(tmpdir) / "condition_kernel.fatbin"
          print("Compiling ...")
          run([nvcc, "-dc", "-rdc=true", *SM_TARGETS, str(SRC), "-o", str(obj)])
          print("Device-linking with libcudadevrt ...")
          run([nvcc, "-dlink", *SM_TARGETS, str(obj), "-lcudadevrt", "-fatbin", "-o", str(fatbin)])
          print("Generating C header ...")
          generate_header(fatbin, OUT_HEADER)
      print("Done.")
  if __name__ == "__main__":
      main()

Comment thread scripts/build_condition_kernel_fatbin.py Outdated
fatbin = Path(tmpdir) / "condition_kernel.fatbin"

print("Compiling ...")
run([nvcc, "-dc", "-rdc=true", *SM_TARGETS, str(SRC), "-o", str(obj)])
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question to Opus: waht does this do?

Copy link
Copy Markdown
Collaborator Author

@hughperkins hughperkins Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opus response:

This is the device linking step. It resolves the cudaGraphSetConditional extern in the compiled object.
The flow is two stages:

  1. nvcc -dc -rdc=true — compiles graph_do_while_cond.cu into a relocatable device object. The call to cudaGraphSetConditional is left as an unresolved extern (because -rdc=true
    means "I'll link later").
  2. nvcc -dlink — takes that object and links it against -lcudadevrt (which provides the cudaGraphSetConditional implementation), producing a self-contained fatbin with the
    extern fully resolved. The -fatbin flag says to output a fat binary containing SASS for all the SM targets.

After this step, the fatbin has no remaining external dependencies — cudaGraphSetConditional is baked in. That's why we don't need libcudadevrt.a at runtime.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the meaning of each commandeline option?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opus esponse:

For the -dlink command:
-dlink — perform device linking (resolve cross-module device function references)
*SM_TARGETS — the -gencode flags specifying which SM architectures to produce SASS for (sm_90, sm_100, sm_120)
str(obj_path) — input: the relocatable device object from the compile step
-lcudadevrt — link against NVIDIA's device runtime library (provides cudaGraphSetConditional)
-fatbin — output a fatbin (fat binary containing SASS for multiple architectures) rather than a single cubin
-o str(fatbin_path) — output file path

And for the -dc compile command:
-dc — compile to device-only relocatable object (no host code, externs left unresolved)
-rdc=true — enable relocatable device code (allows cross-module device function calls, required for calling cudaGraphSetConditional which lives in a separate library)
*SM_TARGETS — same gencode flags
str(SRC) — the .cu source file
-o str(obj_path) — output file path


def run(cmd: list[str]) -> None:
print(f" {' '.join(cmd)}")
subprocess.check_call(cmd)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what ahppens if this fails? do we see the full error messages?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opus response:

Yes — check_call inherits the parent's stdout/stderr, so nvcc's error messages print in real time as the command runs. When it fails, Python then raises CalledProcessError
which adds the exit code. So you see both the full nvcc error output and the Python traceback confirming the non-zero exit.
If we had used capture_output=True or subprocess.run(..., capture_output=True), the error output would be hidden unless explicitly printed from the exception. The current
approach is the right one for a build script.

Comment thread scripts/build_condition_kernel_fatbin.py Outdated
Comment thread scripts/build_condition_kernel_fatbin.py Outdated
Comment thread scripts/build_condition_kernel_fatbin.py Outdated
Comment thread scripts/build_condition_kernel_fatbin.py Outdated
@hughperkins
Copy link
Copy Markdown
Collaborator Author

With the exception of the generated fatbin header file, I have read every line added in this PR, and reviewed the lines. I take responsibilty for the lines added and removed in this PR, and won't blame any issues on Opus.

@hughperkins hughperkins marked this pull request as ready for review April 9, 2026 17:40
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2b72b298b9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

0x33, 0x00, 0x42, 0x75, 0x69, 0x6c, 0x64, 0x20, 0x63, 0x75, 0x64, 0x61, 0x5f, 0x31, 0x32, 0x2e,
0x38, 0x2e, 0x72, 0x31, 0x32, 0x2e, 0x38, 0x2f, 0x63, 0x6f, 0x6d, 0x70, 0x69, 0x6c, 0x65, 0x72,
0x2e, 0x33, 0x35, 0x35, 0x38, 0x33, 0x38, 0x37, 0x30, 0x5f, 0x30, 0x00, 0x2d, 0x61, 0x72, 0x63,
0x68, 0x20, 0x73, 0x6d, 0x5f, 0x31, 0x30, 0x30, 0x20, 0x2d, 0x6d, 0x20, 0x36, 0x34, 0x20, 0x2d,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Regenerate fatbin with SM90 code image

This generated blob advertises an -arch sm_100 image (and sm_120 later in the same file) but contains no sm_90 image, while the runtime still treats all cc >= 90 devices as natively supported. On Hopper-class GPUs (SM90), cuModuleLoadData will fail and graph-do-while raises instead of running, so this change regresses a previously supported architecture. Please regenerate the checked-in fatbin with an SM90 target (or narrow the runtime support gate).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious. It parsed this into binary, and then interpreated that? 😲

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The authro agent says:

see, another agent says it is missing 90:
"[Pasted text #1 +31 lines]
"

The other agent's methodology is wrong. Using strings to look for ASCII text like "sm_90" in a binary is unreliable — the SM version is encoded in binary ELF headers, not
necessarily as an ASCII string. The strings output they found (like -arch sm_100 -m 64) are nvcc command-line fragments embedded in per-section metadata, not an exhaustive
list of architectures.
cuobjdump -lelf is the authoritative tool for inspecting fatbin contents — it actually parses the fatbin format. And it clearly shows all three:

ELF file 1: check_fatbin.1.sm_90.cubin
ELF file 2: check_fatbin.2.sm_100.cubin
ELF file 3: check_fatbin.3.sm_120.cubin

The fatbin is correct. sm_90 is present.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can I verify the header file myself? what command?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract the bytes from the header and run cuobjdump:

cd /path/to/quadrants
python3 -c "
import re
from pathlib import Path
text = Path('quadrants/runtime/cuda/graph_do_while_cond_fatbin.h').read_text()
m = re.search(r'kConditionKernelFatbin[] = {([^}]+)}', text, re.DOTALL)
data = bytes(int(x.strip(), 16) for x in m.group(1).split(',') if x.strip())
Path('/tmp/check.fatbin').write_bytes(data)
" && cuobjdump -lelf /tmp/check.fatbin

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My output:

[Quadrants Build] hugh@hugh5090:~/git/quadrants$ export PATH=$PATH:/usr/local/cuda/bin
[Quadrants Build] hugh@hugh5090:~/git/quadrants$ python3 -c "
import re
from pathlib import Path
text = Path('quadrants/runtime/cuda/graph_do_while_cond_fatbin.h').read_text()
m = re.search(r'kConditionKernelFatbin\[\] = \{([^}]+)\}', text, re.DOTALL)
data = bytes(int(x.strip(), 16) for x in m.group(1).split(',') if x.strip())
Path('/tmp/check.fatbin').write_bytes(data)
" && cuobjdump -lelf /tmp/check.fatbin
ELF file    1: check.1.sm_90.cubin
ELF file    2: check.2.sm_100.cubin
ELF file    3: check.3.sm_120.cubin

Comment on lines +35 to +38
for candidate in ["/usr/local/cuda/bin/nvcc"]:
if Path(candidate).exists():
return candidate
sys.exit("Error: nvcc not found. Install the CUDA toolkit and add it to PATH.")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Resolve nvcc from CUDA_HOME before failing

find_nvcc() checks only PATH and /usr/local/cuda/bin/nvcc, then exits; it never consults CUDA_HOME/CUDA_PATH. That breaks fatbin regeneration on common non-default toolkit installs where users rely on CUDA_HOME, making the new workflow unexpectedly fail even though CUDA is installed.

Useful? React with 👍 / 👎.

Comment on lines +31 to +38
def find_nvcc() -> str:
nvcc = shutil.which("nvcc")
if nvcc:
return nvcc
for candidate in ["/usr/local/cuda/bin/nvcc"]:
if Path(candidate).exists():
return candidate
sys.exit("Error: nvcc not found. Install the CUDA toolkit and add it to PATH.")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The docs promise that setting CUDA_HOME is sufficient to find nvcc, but find_nvcc() never reads that variable — users who follow the documented instructions will hit "Error: nvcc not found" even when ${CUDA_HOME}/bin/nvcc exists. Fix by adding os.environ.get("CUDA_HOME") and os.environ.get("CUDA_PATH") to the candidates list in find_nvcc().

Extended reasoning...

What the bug is

docs/source/user_guide/building_cudagraph_conditional_fatbin.md (line 34) states: "The nvcc binary must be on your PATH, or set CUDA_HOME". This creates a clear user expectation that setting CUDA_HOME is a supported alternative to having nvcc on PATH.

The specific code path that fails

find_nvcc() in scripts/build_condition_kernel_fatbin.py (lines 31–38) does exactly two things:

  1. shutil.which("nvcc") — searches PATH
  2. Checks the hardcoded path /usr/local/cuda/bin/nvcc

It never calls os.environ.get("CUDA_HOME") or os.environ.get("CUDA_PATH"). The environment variables are simply ignored.

Why existing code doesn't prevent it

There is no fallback or warning. If nvcc isn't on PATH and /usr/local/cuda/bin/nvcc doesn't exist, the script calls sys.exit("Error: nvcc not found. Install the CUDA toolkit and add it to PATH.") — and the error message itself only mentions PATH, contradicting the docs that also mention CUDA_HOME.

Impact

This is a common setup in HPC clusters and Docker/container environments where the CUDA toolkit is installed under a non-standard prefix (e.g., /opt/cuda, /usr/local/cuda-12.4) and CUDA_HOME is set by the module system or container entrypoint, but the bin directory is not on PATH. Any developer following the documented workflow will get a misleading error.

How to fix

Add CUDA_HOME and CUDA_PATH lookups to the candidates list in find_nvcc():

import os

def find_nvcc() -> str:
    nvcc = shutil.which("nvcc")
    if nvcc:
        return nvcc
    candidates = [
        os.path.join(os.environ[v], "bin", "nvcc")
        for v in ("CUDA_HOME", "CUDA_PATH")
        if v in os.environ
    ]
    candidates.append("/usr/local/cuda/bin/nvcc")
    for candidate in candidates:
        if Path(candidate).exists():
            return candidate
    sys.exit("Error: nvcc not found. Install the CUDA toolkit and add it to PATH, or set CUDA_HOME.")

Step-by-step proof

  1. User installs CUDA 12.4 under /opt/cuda-12.4.
  2. Their shell sets CUDA_HOME=/opt/cuda-12.4 but does not add /opt/cuda-12.4/bin to PATH.
  3. User reads the docs: "set CUDA_HOME" — they believe they are set up correctly.
  4. User runs python scripts/build_condition_kernel_fatbin.py.
  5. shutil.which("nvcc") returns None (not on PATH).
  6. Path("/usr/local/cuda/bin/nvcc").exists() returns False (wrong prefix).
  7. sys.exit("Error: nvcc not found...") terminates — even though /opt/cuda-12.4/bin/nvcc exists.


## Prerequisites

- CUDA toolkit with `nvcc` (12.4 or later, for conditional node support).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The Prerequisites section states "CUDA toolkit with nvcc (12.4 or later)" but the script targets SM 100 (Blackwell, requires CUDA 12.6+) and SM 120 (RTX 5090, requires CUDA 12.8+), so any developer with CUDA 12.4–12.5 will hit nvcc "unsupported gpu architecture" errors. The minimum version should be raised to at least 12.8 to compile all three SM targets.

Extended reasoning...

What the bug is: The Prerequisites section of docs/source/user_guide/building_cudagraph_conditional_fatbin.md (line 33) states: CUDA toolkit with nvcc (12.4 or later, for conditional node support). This version floor is too low for two of the three SM architectures targeted by the regeneration script.

The specific code path that triggers it: scripts/build_condition_kernel_fatbin.py defines SM_VERSIONS = [90, 100, 120] and expands these into three -gencode=arch=compute_N,code=sm_N flags passed to nvcc. nvcc's support for these architectures was introduced progressively:

  • SM 90 (Hopper / H100): introduced in CUDA 11.8
  • SM 100 (Blackwell B100/B200): introduced in CUDA 12.6
  • SM 120 (Blackwell Ultra / RTX 5090 GB202): introduced in CUDA 12.8

Why existing code doesn't prevent it: The documentation simply asserts 12.4 as sufficient, and the script does no version check before invoking nvcc. Nothing catches the mismatch until nvcc itself emits the error.

What the impact would be: A developer who installs CUDA 12.4 or 12.5 and runs python scripts/build_condition_kernel_fatbin.py will see nvcc errors such as unsupported gpu architecture 'compute_100' and unsupported gpu architecture 'compute_120'. The script will abort without producing the fatbin, and the developer will have wasted time on a misleading prerequisite.

How to fix it: Change line 33 from 12.4 or later to 12.8 or later (since SM 120 requires CUDA 12.8). Optionally, add a note listing the per-architecture minimums for clarity, or add a version check in find_nvcc().

Step-by-step proof:

  1. Developer reads the Prerequisites section and installs CUDA 12.5 (satisfies the stated ≥12.4 requirement).
  2. Developer runs python scripts/build_condition_kernel_fatbin.py.
  3. find_nvcc() finds nvcc from CUDA 12.5.
  4. Script calls: nvcc -dc -rdc=true -gencode=arch=compute_90,code=sm_90 -gencode=arch=compute_100,code=sm_100 -gencode=arch=compute_120,code=sm_120 ...
  5. nvcc 12.5 knows sm_90 but does not know sm_100 or sm_120 — it was released before Blackwell support.
  6. nvcc exits with error: nvcc fatal: Unsupported gpu architecture 'compute_100'.
  7. subprocess.check_call raises CalledProcessError; no fatbin is produced.
  8. Developer is confused because they followed the documented prerequisite exactly.

header_lines.append("")
header_lines.append(f"static const std::size_t kConditionKernelFatbinSize = {len(fatbin_data)};")
header_lines.append("")
out_path.write_text("\n".join(header_lines))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The write_text call at line 68 of generate_header omits newline='\n', so on Windows Python's text mode translates every LF to CRLF, producing a generated header with Windows line endings. A Windows developer who regenerates and commits the fatbin header will introduce CRLF endings into the repo, causing git to show the entire file as changed on Linux pulls. Fix: out_path.write_text('\n'.join(header_lines), encoding='utf-8', newline='\n').

Extended reasoning...

What the bug is and how it manifests

generate_header builds the graph_do_while_cond_fatbin.h header using out_path.write_text("\n".join(header_lines)). Python's Path.write_text opens the file in text mode with the platform-default encoding and newline translation. On Windows, text-mode writes translate every LF (\n) to CRLF (\r\n). This means the generated header will have CRLF line endings whenever the script is run on Windows.

The specific code path

scripts/build_condition_kernel_fatbin.py line 68: out_path.write_text("\n".join(header_lines)). The header_lines list is joined with \n separators, and a trailing \n is appended as the last element. On Linux/macOS, write_text passes these through unmodified. On Windows, each \n becomes \r\n.

Why existing code doesn't prevent it

There is no .gitattributes file in the repo (confirmed by verifiers) to enforce LF normalization on commit. If a Windows developer has core.autocrlf=false or core.autocrlf=input, git will not normalize the line endings, and the committed file will contain CRLF endings.

Impact

Any Linux developer (or CI) that pulls after a Windows regeneration will see the entire graph_do_while_cond_fatbin.h as modified in git diff, even though no functional content changed. This creates noisy diffs and confusion about whether the fatbin was intentionally updated.

How to fix

Change the call to: out_path.write_text("\n".join(header_lines), encoding="utf-8", newline="\n"). Alternatively, use out_path.write_bytes("\n".join(header_lines).encode("utf-8")) which bypasses text-mode translation entirely.

Step-by-step proof

  1. Developer on Windows runs python scripts/build_condition_kernel_fatbin.py.
  2. generate_header assembles header_lines = ["// Auto-generated ...", ..., ""] and calls out_path.write_text("\n".join(header_lines)).
  3. Python opens the file in text mode. Windows text mode translates each \n write to a \r\n byte sequence.
  4. The written file contains // Auto-generated ...\r\n// Do not edit ...\r\n... etc.
  5. Developer runs git diff — entire file shows as changed compared to the LF-only version checked in from Linux.
  6. Developer commits; CI on Linux now also shows the file as modified.

Comment on lines +56 to +60
## Adding a new SM architecture

Edit the `SM_TARGETS` list in `scripts/build_condition_kernel_fatbin.py` to
add the new `-gencode` entry, then regenerate.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The "Adding a new SM architecture" section tells users to "Edit the SM_TARGETS list" but SM_TARGETS is a derived list auto-computed via a list comprehension — the user-editable variable is SM_VERSIONS = [90, 100, 120]. Update the docs to say "Edit the SM_VERSIONS list ... to add the new SM version number (e.g., 130)".

Extended reasoning...

What the bug is: The documentation section "Adding a new SM architecture" (line 58 of the new doc file) instructs users to "Edit the SM_TARGETS list in scripts/build_condition_kernel_fatbin.py to add the new -gencode entry, then regenerate." However, this is incorrect — SM_TARGETS is not the user-editable variable.

The specific code path: In scripts/build_condition_kernel_fatbin.py, the script defines two variables:

  • Line 27: SM_VERSIONS = [90, 100, 120] — a simple list of integer SM version numbers, intended to be edited by the user
  • Line 28: SM_TARGETS = [f"-gencode=arch=compute_{v},code=sm_{v}" for v in SM_VERSIONS] — a derived list of full gencode strings, auto-computed from SM_VERSIONS via a list comprehension

Why existing code doesn't prevent it: The doc was written referencing the original Opus-generated script (visible in the PR timeline at comment 3059546304), which had SM_TARGETS as a hand-edited list of full gencode strings. Later in the review, hughperkins asked "cant we just specify versions, rather than the entire string?" (comment 3059569723, marked resolved), which led to the refactoring that introduced SM_VERSIONS. The docs were not updated to reflect this change.

What the impact would be: A developer following these instructions to add SM 130 support would search build_condition_kernel_fatbin.py for SM_TARGETS expecting to find a list of integers. Instead they find a list comprehension expression. While they could technically append to SM_TARGETS directly (it's just a Python list at runtime), this would bypass the SM_VERSIONS design intent, and the doc's phrasing "add the new -gencode entry" matches the old style, not the current one. The correct action is to add an integer to SM_VERSIONS.

How to fix it: Update line 58 of docs/source/user_guide/building_cudagraph_conditional_fatbin.md from:

Edit the SM_TARGETS list in scripts/build_condition_kernel_fatbin.py to add the new -gencode entry, then regenerate.

To:

Edit the SM_VERSIONS list in scripts/build_condition_kernel_fatbin.py to add the new SM version number (e.g., 130), then regenerate.

Step-by-step proof: (1) User reads doc, sees "Edit the SM_TARGETS list". (2) User opens scripts/build_condition_kernel_fatbin.py and searches for SM_TARGETS. (3) They find SM_TARGETS = [f"-gencode=arch=compute_{v},code=sm_{v}" for v in SM_VERSIONS] — a list comprehension, not a hand-edited list. (4) They are confused about how to add a new entry. (5) The correct action — adding 130 to SM_VERSIONS = [90, 100, 120] — is not mentioned anywhere in the doc.

…ne endings

- find_nvcc() now checks CUDA_HOME/CUDA_PATH env vars before fallback
- Fix CUDA toolkit prerequisite: 12.4 -> 12.8 (needed for SM 120)
- Use newline="\n" in write_text to avoid CRLF on Windows
- Docs: reference SM_VERSIONS (not SM_TARGETS) for adding architectures
@hughperkins
Copy link
Copy Markdown
Collaborator Author

Thanks!

…dd-fatbin

# Conflicts:
#	quadrants/runtime/cuda/graph_manager.cpp
@hughperkins hughperkins enabled auto-merge (squash) April 10, 2026 11:00
@hughperkins hughperkins merged commit 83e4b3b into main Apr 10, 2026
47 checks passed
@hughperkins hughperkins deleted the hp/cuda-graph-mvp-5-add-fatbin branch April 10, 2026 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants