tpn · tpn · Mar 30, 2026 · Mar 30, 2026 · Mar 30, 2026 · Mar 30, 2026
diff --git a/docs/chm02-cuda-mainline.md b/docs/chm02-cuda-mainline.md
@@ -0,0 +1,151 @@
+# Chm02 CUDA Mainline Note
+
+## Summary
+
+This note captures the intent of the `issue-79-chm02-cuda-mainline` branch.
+
+The branch promotes the legacy `Chm02` CUDA path from a CPU-assisted bring-up
+ state toward a first-class correctness path by moving the major solve phases
+ (`IsAcyclic`, `Assign`, `Verify`) onto the GPU while keeping CPU-oracle-style
+ validation and debugging support available during bring-up.
+
+## Goals
+
+- Fix correctness blockers in the existing `Chm02` CUDA path.
+- Make known-seed CLI runs succeed on Linux in both no-file-io and file-io
+  configurations.
+- Add regression coverage for:
+  - known-seed `Chm02` CUDA runs
+  - a generated non-`Assigned16` case
+  - timing-field presence
+- Expose explicit per-phase CUDA timing fields for measurement.
+
+## Current Mainline Meaning
+
+For this branch, “mainline” means:
+
+- the major `Chm02` solve phases are GPU-backed
+- the path is correctness-first, not throughput-first
+- CPU graph state is still required as part of the current implementation for
+  bring-up compatibility and oracle-style validation support
+
+## Non-Goals
+
+- High-throughput GPU solving.
+- Batched multi-attempt GPU construction.
+- Replacing the standalone GPU peeling POC.
+- Eliminating all CPU-oracle/debug-only code from the branch.
+
+The current `Chm02` CUDA implementation remains correctness-first, not
+ throughput-first.
+
+## Supported Scope
+
+- Algorithm: `Chm02`
+- Hash path: the branch is only accepted against the combinations covered by
+  the focused regression matrix below; broader hash-family support remains a
+  follow-on concern
+- CUDA path: single-graph bring-up / validation
+- Platform focus:
+  - Linux with CUDA enabled
+  - existing regression coverage on the configured CUDA host
+
+The following supporting code changes are considered in-scope for this branch:
+
+- Linux file-work compatibility fixes needed for the `Chm02Compat` path
+- CSV/timing schema updates needed to surface CUDA phase timing
+- the Linux `QueryPerformanceFrequency()` correction that makes those timings
+  sane on non-Windows builds
+
+## Fallback / Debugging Policy
+
+- Normal operation should use the GPU path for add-keys, acyclic detection,
+  assignment, and verify.
+- CPU-oracle and order-validation logic is intended as bring-up/debug support.
+- `PH_DEBUG_CUDA_CHM02` enables extra logging and validation details for
+  troubleshooting.
+
+## Timing Contract
+
+The following CSV fields are emitted:
+
+- `CuAddKeysMicroseconds`
+- `CuIsAcyclicMicroseconds`
+- `CuAssignMicroseconds`
+- `CuVerifyMicroseconds`
+
+These are synchronized phase timings around the CUDA-backed phase wrappers, not
+ raw kernel-only device timings.
+
+Compatibility note:
+
+- this branch preserves the historical
+  `GpuIsAcyclicButCpuIsCyclicFailures` column as a zero-valued compatibility
+  stub in order to keep downstream CSV column positions stable
+- this branch intentionally adds the four `Cu*` timing fields above
+- the existing non-CUDA timing fields should continue to use the same timing
+  base; the Linux `QueryPerformanceFrequency()` fix is included specifically so
+  those timings remain coherent on this platform as well as for the new CUDA
+  timing fields
+
+## Failure-Path Expectations
+
+- Cyclic graphs are expected to return normal non-success solve results; they
+  are not considered internal errors.
+- CUDA-disabled builds are expected to continue using the non-CUDA code paths.
+- GPU order-validation and extra CPU-oracle diagnostics are debug-only aids,
+  controlled by `PH_DEBUG_CUDA_CHM02`.
+- Non-debug runs are expected to surface failure through the normal `HRESULT`
+  and verification paths, not through verbose stderr diagnostics.
+- The current serial CUDA kernels are correctness-first and must not be treated
+  as throughput-optimized production behavior.
+
+## Debug Surface
+
+The following debug surface is intentionally supported for this bring-up phase:
+
+- `PH_DEBUG_CUDA_CHM02`
+- stderr logging from the CUDA `Chm02` path
+- stable debug tokens used by the known-seed regression harnesses:
+  - `PH_CHM02_CUDA_ORDER_OK`
+  - `PH_CHM02_CUDA_ASSIGN_OK`
+  - `PH_CHM02_CUDA_VERIFY_OK`
+
+This surface is explicitly considered temporary bring-up instrumentation, not a
+ long-term stable user-facing API.
+
+For this branch, however, the three `PH_CHM02_CUDA_*_OK` tokens are treated as
+ a supported test contract for the focused known-seed regression harness.
+
+In addition to the debug-token path, this branch also requires one non-debug
+ known-seed regression to pass, in order to prove that the release-like path
+ succeeds without depending on `PH_DEBUG_CUDA_CHM02`.
+
+## Staged Task List
+
+1. Fix correctness blockers in the legacy CUDA `Chm02` path.
+2. Establish known-seed Linux no-file-io parity.
+3. Establish Linux file-io parity.
+4. Move assignment and verify onto the GPU.
+5. Expose explicit per-phase CUDA timing fields for measurement.
+6. Add focused CUDA regression coverage:
+   - known-seed path
+   - non-debug known-seed path
+   - non-`Assigned16` generated path
+   - timing-field presence
+7. Verify release-like behavior without relying on a silent CPU fallback:
+   - no-file-io path
+   - file-io path
+   - non-debug failure propagation remains via normal `HRESULT` / verify paths
+
+## Acceptance
+
+- The focused CUDA `Chm02` regression tests pass when CUDA is enabled.
+- Known-seed Linux coverage passes for:
+  - HologramWorld known-seed, no-file-io
+  - HologramWorld known-seed, file-io
+  - HologramWorld known-seed, non-debug no-file-io
+- Generated non-`Assigned16` coverage passes for:
+  - generated `33000`-key case
+- Timing fields are present and non-negative in CSV output.
+- CUDA-disabled builds continue to use the non-CUDA path.
diff --git a/src/PerfectHash/BulkCreateBestCsv.h b/src/PerfectHash/BulkCreateBestCsv.h
@@ -281,15 +281,15 @@ Module Name:
           OUTPUT_INT)                                                                        \
                                                                                              \
     ENTRY(GpuIsAcyclicButCpuIsCyclicFailures,                                                \
-          Context->GpuIsAcyclicButCpuIsCyclicFailures,                                       \
+          0,                                                                                 \
           OUTPUT_INT)                                                                        \
                                                                                              \
     ENTRY(GpuAndCpuAddKeysSuccess,                                                           \
           Context->GpuAndCpuAddKeysSuccess,                                                  \
           OUTPUT_INT)                                                                        \
                                                                                              \
     ENTRY(GpuAndCpuIsAcyclicSuccess,                                                         \
-          Context->GpuAndCpuAddKeysSuccess,                                                  \
+          Context->GpuAndCpuIsAcyclicSuccess,                                                \
           OUTPUT_INT)                                                                        \
                                                                                              \
     ENTRY(BestCoverageAttempts,                                                              \

diff --git a/src/PerfectHash/BulkCreateCsv.h b/src/PerfectHash/BulkCreateCsv.h
@@ -280,15 +280,15 @@ Module Name:
           OUTPUT_INT)                                                                        \
                                                                                              \
     ENTRY(GpuIsAcyclicButCpuIsCyclicFailures,                                                \
-          Context->GpuIsAcyclicButCpuIsCyclicFailures,                                       \
+          0,                                                                                 \
           OUTPUT_INT)                                                                        \
                                                                                              \
     ENTRY(GpuAndCpuAddKeysSuccess,                                                           \
           Context->GpuAndCpuAddKeysSuccess,                                                  \
           OUTPUT_INT)                                                                        \
                                                                                              \
     ENTRY(GpuAndCpuIsAcyclicSuccess,                                                         \
-          Context->GpuAndCpuAddKeysSuccess,                                                  \
+          Context->GpuAndCpuIsAcyclicSuccess,                                                \
           OUTPUT_INT)                                                                        \
                                                                                              \
     ENTRY(BestCoverageAttempts,                                                              \
@@ -424,6 +424,22 @@ Module Name:
           Context->VerifyElapsedMicroseconds.QuadPart,                                       \
           OUTPUT_INT)                                                                        \
                                                                                              \
+    ENTRY(CuAddKeysMicroseconds,                                                             \
+          Table->CuAddKeysElapsedMicroseconds.QuadPart,                                      \
+          OUTPUT_INT)                                                                        \
+                                                                                             \
+    ENTRY(CuIsAcyclicMicroseconds,                                                           \
+          Table->CuIsAcyclicElapsedMicroseconds.QuadPart,                                    \
+          OUTPUT_INT)                                                                        \
+                                                                                             \
+    ENTRY(CuAssignMicroseconds,                                                              \
+          Table->CuAssignElapsedMicroseconds.QuadPart,                                       \
+          OUTPUT_INT)                                                                        \
+                                                                                             \
+    ENTRY(CuVerifyMicroseconds,                                                              \
+          Table->CuVerifyElapsedMicroseconds.QuadPart,                                       \
+          OUTPUT_INT)                                                                        \
+                                                                                             \
     ENTRY(BenchmarkWarmups,                                                                  \
           Table->BenchmarkWarmups,                                                           \
           OUTPUT_INT)                                                                        \

diff --git a/src/PerfectHash/Chm01FileWork.c b/src/PerfectHash/Chm01FileWork.c
@@ -54,7 +54,6 @@ PERFECT_HASH_FILE_WORK_ITEM_CALLBACK FileWorkItemCallbackChm01;
 // Begin method implementations.
 //
 
-#ifdef PH_WINDOWS
 PERFECT_HASH_FILE_WORK_CALLBACK FileWorkCallbackChm01;
 
 _Use_decl_annotations_
@@ -88,13 +87,17 @@ Return Value:
 {
     PFILE_WORK_ITEM Item;
 
+    if (!ARGUMENT_PRESENT(ListEntry)) {
+        return;
+    }
+
     Item = CONTAINING_RECORD(ListEntry, FILE_WORK_ITEM, ListEntry);
+
     Item->Instance = Instance;
     Item->Context = Context;
 
     FileWorkItemCallbackChm01(Item);
 }
-#endif
 
 _Use_decl_annotations_
 VOID

diff --git a/src/PerfectHash/Chm01FileWorkStub.c b/src/PerfectHash/Chm01FileWorkStub.c
@@ -16,7 +16,6 @@ Module Name:
 
 #ifdef PH_ONLINE_ONLY
 
-#ifdef PH_WINDOWS
 PERFECT_HASH_FILE_WORK_ITEM_CALLBACK FileWorkItemCallbackChm01;
 
 PERFECT_HASH_FILE_WORK_CALLBACK FileWorkCallbackChm01;
@@ -41,7 +40,6 @@ FileWorkCallbackChm01(
 
     FileWorkItemCallbackChm01(Item);
 }
-#endif
 
 _Use_decl_annotations_
 VOID
@@ -60,4 +58,4 @@ FileWorkItemCallbackChm01(
 
 #endif // PH_ONLINE_ONLY
 
-// vim:set ts=8 sw=4 sts=4 tw=80 expandtab                                     :
+// vim:set ts=8 sw=4 sts=4 tw=80 expandtab                                     :