Skip to content

Using DisableLLVMLoopOpt can generate crashy Cuda code #6061

@steven-johnson

Description

@steven-johnson

To see this, run correctness_gpu_dynamic_shared with HL_JIT_TARGET=host-cuda-disable_llvm_loop_opt; on (at least) x86-64-Linux systems, you will crash with illegal memory access. (Note that only the case in the test with per_thread=1, memory_type=GPUShared fails; editing the test to only run this case makes debugging a bit simpler.)

It's not at all clear yet whether the culprit here is in LLVM or in the NVidia Driver. (It's almost certainly not Halide per se, as our IR is identical whether you use disable_llvm_loop_opt or not.)

@abadams and I both suspect the driver, as

  • we've only seen this on x86-64-linux systems running recent "real" NVidia drivers (not the open-source variant)
  • looking at the PTX disassembly and hand-walking thru it doesn't show anything obviously wrong to our eyes
  • the failure appears to be a write that is one-past-the-end of the shared memory block
  • running under cuda-memcheck and cuda-gdb hasn't enlightened us any further
  • same behavior is seen when building Halide with LLVM11/12/13

We'd like to run this to ground so that we can consider landing #5019, but are a bit at a loss as to how to do so -- next step might be to see if we have a contact inside NVidia (or, perhaps a PTX Ninja who might know more than us) to help take a look.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions