Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions docs/chm02-cuda-mainline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
# Chm02 CUDA Mainline Note

## Summary

This note captures the intent of the `issue-79-chm02-cuda-mainline` branch.

The branch promotes the legacy `Chm02` CUDA path from a CPU-assisted bring-up
state toward a first-class correctness path by moving the major solve phases
(`IsAcyclic`, `Assign`, `Verify`) onto the GPU while keeping CPU-oracle-style
validation and debugging support available during bring-up.

## Goals

- Fix correctness blockers in the existing `Chm02` CUDA path.
- Make known-seed CLI runs succeed on Linux in both no-file-io and file-io
configurations.
- Add regression coverage for:
- known-seed `Chm02` CUDA runs
- a generated non-`Assigned16` case
- timing-field presence
- Expose explicit per-phase CUDA timing fields for measurement.

## Current Mainline Meaning

For this branch, “mainline” means:

- the major `Chm02` solve phases are GPU-backed
- the path is correctness-first, not throughput-first
- CPU graph state is still required as part of the current implementation for
bring-up compatibility and oracle-style validation support

## Non-Goals

- High-throughput GPU solving.
- Batched multi-attempt GPU construction.
- Replacing the standalone GPU peeling POC.
- Eliminating all CPU-oracle/debug-only code from the branch.

The current `Chm02` CUDA implementation remains correctness-first, not
throughput-first.

## Supported Scope

- Algorithm: `Chm02`
- Hash path: the branch is only accepted against the combinations covered by
the focused regression matrix below; broader hash-family support remains a
follow-on concern
- CUDA path: single-graph bring-up / validation
- Platform focus:
- Linux with CUDA enabled
- existing regression coverage on the configured CUDA host

The following supporting code changes are considered in-scope for this branch:

- Linux file-work compatibility fixes needed for the `Chm02Compat` path
- CSV/timing schema updates needed to surface CUDA phase timing
- the Linux `QueryPerformanceFrequency()` correction that makes those timings
sane on non-Windows builds

## Fallback / Debugging Policy

- Normal operation should use the GPU path for add-keys, acyclic detection,
assignment, and verify.
- CPU-oracle and order-validation logic is intended as bring-up/debug support.
- `PH_DEBUG_CUDA_CHM02` enables extra logging and validation details for
troubleshooting.

## Timing Contract

The following CSV fields are emitted:

- `CuAddKeysMicroseconds`
- `CuIsAcyclicMicroseconds`
- `CuAssignMicroseconds`
- `CuVerifyMicroseconds`

These are synchronized phase timings around the CUDA-backed phase wrappers, not
raw kernel-only device timings.

Compatibility note:

- this branch preserves the historical
`GpuIsAcyclicButCpuIsCyclicFailures` column as a zero-valued compatibility
stub in order to keep downstream CSV column positions stable
- this branch intentionally adds the four `Cu*` timing fields above
- the existing non-CUDA timing fields should continue to use the same timing
base; the Linux `QueryPerformanceFrequency()` fix is included specifically so
those timings remain coherent on this platform as well as for the new CUDA
timing fields

## Failure-Path Expectations

- Cyclic graphs are expected to return normal non-success solve results; they
are not considered internal errors.
- CUDA-disabled builds are expected to continue using the non-CUDA code paths.
- GPU order-validation and extra CPU-oracle diagnostics are debug-only aids,
controlled by `PH_DEBUG_CUDA_CHM02`.
- Non-debug runs are expected to surface failure through the normal `HRESULT`
and verification paths, not through verbose stderr diagnostics.
- The current serial CUDA kernels are correctness-first and must not be treated
as throughput-optimized production behavior.

## Debug Surface

The following debug surface is intentionally supported for this bring-up phase:

- `PH_DEBUG_CUDA_CHM02`
- stderr logging from the CUDA `Chm02` path
- stable debug tokens used by the known-seed regression harnesses:
- `PH_CHM02_CUDA_ORDER_OK`
- `PH_CHM02_CUDA_ASSIGN_OK`
- `PH_CHM02_CUDA_VERIFY_OK`

This surface is explicitly considered temporary bring-up instrumentation, not a
long-term stable user-facing API.

For this branch, however, the three `PH_CHM02_CUDA_*_OK` tokens are treated as
a supported test contract for the focused known-seed regression harness.

In addition to the debug-token path, this branch also requires one non-debug
known-seed regression to pass, in order to prove that the release-like path
succeeds without depending on `PH_DEBUG_CUDA_CHM02`.

## Staged Task List

1. Fix correctness blockers in the legacy CUDA `Chm02` path.
2. Establish known-seed Linux no-file-io parity.
3. Establish Linux file-io parity.
4. Move assignment and verify onto the GPU.
5. Expose explicit per-phase CUDA timing fields for measurement.
6. Add focused CUDA regression coverage:
- known-seed path
- non-debug known-seed path
- non-`Assigned16` generated path
- timing-field presence
7. Verify release-like behavior without relying on a silent CPU fallback:
- no-file-io path
- file-io path
- non-debug failure propagation remains via normal `HRESULT` / verify paths

## Acceptance

- The focused CUDA `Chm02` regression tests pass when CUDA is enabled.
- Known-seed Linux coverage passes for:
- HologramWorld known-seed, no-file-io
- HologramWorld known-seed, file-io
- HologramWorld known-seed, non-debug no-file-io
- Generated non-`Assigned16` coverage passes for:
- generated `33000`-key case
- Timing fields are present and non-negative in CSV output.
- CUDA-disabled builds continue to use the non-CUDA path.
4 changes: 2 additions & 2 deletions src/PerfectHash/BulkCreateBestCsv.h
Original file line number Diff line number Diff line change
Expand Up @@ -281,15 +281,15 @@ Module Name:
OUTPUT_INT) \
\
ENTRY(GpuIsAcyclicButCpuIsCyclicFailures, \
Context->GpuIsAcyclicButCpuIsCyclicFailures, \
0, \
OUTPUT_INT) \
\
ENTRY(GpuAndCpuAddKeysSuccess, \
Context->GpuAndCpuAddKeysSuccess, \
OUTPUT_INT) \
\
ENTRY(GpuAndCpuIsAcyclicSuccess, \
Context->GpuAndCpuAddKeysSuccess, \
Context->GpuAndCpuIsAcyclicSuccess, \
OUTPUT_INT) \
\
ENTRY(BestCoverageAttempts, \
Expand Down
20 changes: 18 additions & 2 deletions src/PerfectHash/BulkCreateCsv.h
Original file line number Diff line number Diff line change
Expand Up @@ -280,15 +280,15 @@ Module Name:
OUTPUT_INT) \
\
ENTRY(GpuIsAcyclicButCpuIsCyclicFailures, \
Context->GpuIsAcyclicButCpuIsCyclicFailures, \
0, \
OUTPUT_INT) \
\
ENTRY(GpuAndCpuAddKeysSuccess, \
Context->GpuAndCpuAddKeysSuccess, \
OUTPUT_INT) \
\
ENTRY(GpuAndCpuIsAcyclicSuccess, \
Context->GpuAndCpuAddKeysSuccess, \
Context->GpuAndCpuIsAcyclicSuccess, \
OUTPUT_INT) \
\
ENTRY(BestCoverageAttempts, \
Expand Down Expand Up @@ -424,6 +424,22 @@ Module Name:
Context->VerifyElapsedMicroseconds.QuadPart, \
OUTPUT_INT) \
\
ENTRY(CuAddKeysMicroseconds, \
Table->CuAddKeysElapsedMicroseconds.QuadPart, \
OUTPUT_INT) \
\
ENTRY(CuIsAcyclicMicroseconds, \
Table->CuIsAcyclicElapsedMicroseconds.QuadPart, \
OUTPUT_INT) \
\
ENTRY(CuAssignMicroseconds, \
Table->CuAssignElapsedMicroseconds.QuadPart, \
OUTPUT_INT) \
\
ENTRY(CuVerifyMicroseconds, \
Table->CuVerifyElapsedMicroseconds.QuadPart, \
OUTPUT_INT) \
\
ENTRY(BenchmarkWarmups, \
Table->BenchmarkWarmups, \
OUTPUT_INT) \
Expand Down
7 changes: 5 additions & 2 deletions src/PerfectHash/Chm01FileWork.c
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,6 @@ PERFECT_HASH_FILE_WORK_ITEM_CALLBACK FileWorkItemCallbackChm01;
// Begin method implementations.
//

#ifdef PH_WINDOWS
PERFECT_HASH_FILE_WORK_CALLBACK FileWorkCallbackChm01;

_Use_decl_annotations_
Expand Down Expand Up @@ -88,13 +87,17 @@ Return Value:
{
PFILE_WORK_ITEM Item;

if (!ARGUMENT_PRESENT(ListEntry)) {
return;
}

Item = CONTAINING_RECORD(ListEntry, FILE_WORK_ITEM, ListEntry);

Item->Instance = Instance;
Item->Context = Context;

FileWorkItemCallbackChm01(Item);
}
#endif

_Use_decl_annotations_
VOID
Expand Down
4 changes: 1 addition & 3 deletions src/PerfectHash/Chm01FileWorkStub.c
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ Module Name:

#ifdef PH_ONLINE_ONLY

#ifdef PH_WINDOWS
PERFECT_HASH_FILE_WORK_ITEM_CALLBACK FileWorkItemCallbackChm01;

PERFECT_HASH_FILE_WORK_CALLBACK FileWorkCallbackChm01;
Expand All @@ -41,7 +40,6 @@ FileWorkCallbackChm01(

FileWorkItemCallbackChm01(Item);
}
#endif

_Use_decl_annotations_
VOID
Expand All @@ -60,4 +58,4 @@ FileWorkItemCallbackChm01(

#endif // PH_ONLINE_ONLY

// vim:set ts=8 sw=4 sts=4 tw=80 expandtab :
// vim:set ts=8 sw=4 sts=4 tw=80 expandtab :
Loading
Loading