Skip to content

Chm02 CUDA: mainline correctness fixes, tests, and timing surface#84

Open
tpn wants to merge 14 commits intomainfrom
issue-79-chm02-cuda-mainline
Open

Chm02 CUDA: mainline correctness fixes, tests, and timing surface#84
tpn wants to merge 14 commits intomainfrom
issue-79-chm02-cuda-mainline

Conversation

@tpn
Copy link
Copy Markdown
Owner

@tpn tpn commented Mar 31, 2026

Closes #79

Summary

This PR mainlines the validated Chm02 CUDA bring-up work as a focused correctness-first integration slice.

It promotes the legacy Chm02 CUDA path from a CPU-assisted bring-up flow toward a first-class correctness path for single-graph runs by moving the major solve phases onto the GPU, fixing Linux compatibility issues, adding focused regression coverage, and exposing explicit CUDA timing fields.

Included

  • Graph.cu / GraphCu.c fixes for CUDA add-keys, peel/order capture, assignment, and verify.
  • Linux Chm02Compat / file-work fixes needed for no-file-io and file-io parity.
  • Focused regression coverage for CUDA Chm02:
    • Hologram known-seed no-file-io
    • Hologram known-seed file-io
    • generated non-Assigned16 case
    • timing-field presence
  • Explicit per-phase CUDA timing fields in CSV output.
  • docs/chm02-cuda-mainline.md describing scope, non-goals, compatibility notes, and staged acceptance.

Non-goals

  • The standalone batched GPU peeling POC.
  • Large-scale GPU constructor/performance work.
  • Throughput-optimized Chm02 CUDA kernels.

Validation

Ran:

  • cmake --build build-cuda -j2
  • ctest --test-dir build-cuda --output-on-failure -R 'perfecthash\.cuda\.chm02'

Current focused CUDA Chm02 test result:

  • 4/4 passed

Notes

This remains a correctness-first CUDA path. The serial CUDA kernels are intentionally bring-up oriented and should not be read as a throughput claim.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 302f13b036

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1281 to +1282
CUDA_CALL(cudaMalloc((void **)&DeviceFailures, sizeof(*DeviceFailures)));
CUDA_CALL(cudaMemsetAsync(DeviceFailures, 0, sizeof(*DeviceFailures), Stream));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Free verify scratch buffer on CUDA API errors

After DeviceFailures is allocated, subsequent calls still use CUDA_CALL(...), which returns immediately on failure instead of jumping to End. In this function that skips the cudaFree(DeviceFailures) cleanup path, so any failure in cudaMemsetAsync, later memcpy/sync calls, etc. leaks device memory and can snowball across repeated verify attempts in long-running runs.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mainline Chm02 CUDA bring-up fixes, tests, and timing surface

1 participant