Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions ADS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# ADS Notes

Context
- Table-size ADS (keys:Algorithm_Hash_Mask.TableSize) is only used to seed
RequestedNumberOfTableElements for the current run. If missing, creation
still works; the requested size stays 0 and the solver proceeds normally.
- Table-info ADS (table.pht1:Info) is required for Table->Vtbl->Load() and
for downstream generators/tests that read Table->TableInfoOnDisk.

Key locations
- Table-size load/save: `src/PerfectHash/PerfectHashKeysLoadTableSize.c`
- Table-info stream save: `src/PerfectHash/Chm01FileWorkTableInfoStream.c`
- Table-info load: `src/PerfectHash/PerfectHashTableLoad.c`

Implications
- Dropping table-size ADS is low risk if we accept losing the "reuse previous
table size" hint. It is already optional from a correctness standpoint.
- Dropping table-info ADS breaks load-from-disk and any tools that read
TableInfoOnDisk (CSV outputs, generators, tests). A replacement on-disk
metadata format would be required first.

Notes
- ReFS/ADS issues observed: zero-length TableSize streams caused
PH_E_INVALID_END_OF_FILE. Fixed by extending empty existing streams when
NoTruncate is set.
- If we want to avoid ADS entirely, table-size is a good first candidate to
move to a sidecar file (e.g., .TableSize) without affecting core table load.
3 changes: 3 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,6 @@

## File enum capacity
- File enums are tracked via a 64-bit bitmap. When we run out of bits, add a new enum group with a `_2`, `_3`, etc. suffix and keep the existing enum ordering rules intact.

## Auto-generated files
- `include/PerfectHashEvents.h` is auto-generated; discard local changes before rebasing or merging.
2 changes: 2 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ if(PERFECTHASH_ENABLE_INSTALL)
PerfectHash
PerfectHashCreateExe
PerfectHashBulkCreateExe
PerfectHashServerExe
PerfectHashClientExe
RUNTIME DESTINATION ${CMAKE_INSTALL_BINDIR}
LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}
Expand Down
217 changes: 217 additions & 0 deletions IOCP-LOGS.md

Large diffs are not rendered by default.

105 changes: 105 additions & 0 deletions IOCP-PROMPT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# IOCP Backend Handoff Prompt (Start Here)

Continue IOCP server/client development for PerfectHash with the current state below.

## Goal

Make bulk perfect-hash creation saturate CPU effectively by decoupling work from workers with a NUMA-aware IOCP pipeline, while keeping OG executables/paths intact.

## What Is Implemented

- Additive IOCP architecture (new server/client executables; OG untouched).
- Per-NUMA node IOCP runtime:
- one completion port per node
- manually created worker threads
- node/group affinity wiring.
- Named-pipe request/response state machine with:
- ping/pong readiness
- shutdown
- table-create and bulk-directory requests.
- Bulk directory request flow:
- server walks `.keys` files
- dispatches per-file work across configured nodes
- returns a single completion token (event/mapping) to client.
- CHM01 async state-machine path present with per-file ramp controls:
- `--InitialPerFileConcurrency`
- `--MaxPerFileConcurrency`
- `--IncreaseConcurrencyAfterMilliseconds`.
- IOCP file-work path has overlapped save I/O and keys-load overlapped reads.
- IOCP buffer pool infrastructure exists and is being reworked toward NUMA lookaside-like size classes with optional guard pages.

## Important Fixes Already Landed

- Lifetime fixes:
- retained command-line/argv buffers for bulk request lifetime
- fixed use-after-free in bulk callback path.
- Async accounting fixes:
- requeue failure now decrements outstanding
- graph-submit failure rollback fixed (`ActiveGraphs`, loop counters, cleanup).
- Crash fixes:
- guarded legacy threadpool callback usage (`TpIsTimerSet` crash source)
- fixed stale `NumberOfBytesWritten`/sizing bugs that produced oversized files
- fixed IOCP file sizing index bug (`FileId` vs `FileWorkId`).
- Added crash diagnostics/minidump improvements and server/client wait-for-ready behavior.

## Observed Performance Snapshot (Recent)

- On some file-I/O workloads, IOCP outperformed OG.
- On some no-file-I/O workloads, OG remained faster.
- High concurrency (`IocpConcurrency=64`) exposed memory pressure/over-allocation; pool policy still needs tuning.

## Current Risks / Gaps

- Buffer pool policy can overprovision at high concurrency.
- Need stricter invariants around async completion counters to prevent latent hangs.
- Large-payload file writes need explicit chunk/flush strategy validation.
- Need broader repeatable E2E coverage (test1/hard/sys32 subsets/full).

## Next Execution Order

1. Lock correctness first:
- counter invariants
- completion-once guarantees
- failure-path decrements.
2. Finish buffer-pool redesign:
- per-NUMA size-class global pools
- oversize reuse pools
- safe guarded-list rundown.
3. Harden overlapped file-I/O:
- large write chunking
- strict bounds/fail-fast behavior
- OG path unchanged.
4. Re-run standardized perf matrix and tune defaults.

## Key Files

- IOCP runtime:
- `src/PerfectHash/PerfectHashContextIocp.c`
- `src/PerfectHash/PerfectHashContextIocp.h`
- Server/client core:
- `src/PerfectHash/PerfectHashServer.c`
- `src/PerfectHash/PerfectHashClient.c`
- Async engine:
- `src/PerfectHash/PerfectHashAsync.c`
- `src/PerfectHash/Chm01Async.c`
- IOCP file work/buffer pool:
- `src/PerfectHash/Chm01FileWork.c`
- `src/PerfectHash/Chm01FileWorkIocp.c`
- `src/PerfectHash/PerfectHashIocpBufferPool.c`
- `src/PerfectHash/PerfectHashIocpBufferPool.h`
- Ledgers:
- `IOCP-LOGS.md`
- `IOCP-TOOD.md`
- `IOCP-README.md`

## Quick Commands

- Build:
- `cmake --build build-win --config Release --target PerfectHashServerExe PerfectHashClientExe PerfectHashBulkCreateExe`
- IOCP smoke:
- `powershell -NoProfile -ExecutionPolicy Bypass -File scripts\iocp-smoke.ps1 -BuildDir build-win -Config Release`
- IOCP stress:
- `powershell -NoProfile -ExecutionPolicy Bypass -File scripts\stress-sys32-iocp.ps1 -BuildDir build-win -Config Release -IocpConcurrency 32 -MaxThreads 64`
- OG stress:
- `powershell -NoProfile -ExecutionPolicy Bypass -File scripts\stress-sys32.ps1 -BuildDir build-win -Config Release -MaximumConcurrency 32`

72 changes: 72 additions & 0 deletions IOCP-README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# IOCP PerfectHash Backend

This document is the current high-level reference for the additive IOCP implementation.

## Scope

- New server/client path for table and bulk create requests.
- Keep OG `CreateExe` / `BulkCreateExe` behavior intact.
- Use manual Windows threads + `GetQueuedCompletionStatus()` loops.
- NUMA-aware design: one IOCP per node, affinity-aware workers.

## Design Summary

1. Client sends request over named pipe.
2. Server accepts request via IOCP pipe state machine.
3. For bulk-directory requests:
- enumerate `.keys` files
- dispatch per-file work items (round-robin across selected NUMA nodes)
- track per-request counters
- signal one completion token when all file work is complete.
4. Client waits on token and receives a final bulk status code.

## Concurrency Controls

- `--IocpConcurrency`:
- completion-port concurrency level (also aliasable from older `--MaxConcurrency` usage in some paths/scripts).
- `--MaxThreads`:
- max worker threads to create (default: `IocpConcurrency * 2`).
- Per-file async ramp:
- `--InitialPerFileConcurrency`
- `--MaxPerFileConcurrency`
- `--IncreaseConcurrencyAfterMilliseconds`.

## File I/O Mode

- IOCP path:
- overlapped writes for generated files
- overlapped keys reads
- pooled intermediate buffers.
- OG path:
- existing memory-mapped behavior preserved.

## Buffering

- Direction: NUMA-aware lookaside-style pooling.
- Current state:
- size-class pools (power-of-two) are in place and evolving
- oversize buffer handling exists and needs further hardening/tuning
- optional guard-page behavior is supported for safer debug/development runs.

## Operational Flags

- `--WaitForServer` and `--ConnectTimeout=<ms>` on client for robust startup coordination.
- `--Verbose` on server gates console output (silent by default).
- `--NoFileIo` supported for rapid compute-only stress loops.

## Current Status

- Core IOCP pipeline is functional.
- Small/medium dataset runs are stable.
- Full sys32 runs have succeeded in both `NoFileIo` and file-I/O modes under tested settings.
- Remaining work is mostly:
- correctness hardening for edge/failure paths
- buffer pool policy/memory scaling
- deeper performance tuning and regression automation.

## See Also

- Execution log: `IOCP-LOGS.md`
- Active backlog: `IOCP-TOOD.md`
- Session handoff prompt: `IOCP-PROMPT.md`

102 changes: 102 additions & 0 deletions IOCP-TOOD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# IOCP TODO

This is the prioritized, session-ready backlog for the IOCP server/client work.

## P0: Correctness + Completion Guarantees

- [ ] Re-verify bulk completion accounting under stress (single token must always signal exactly once):
- `OutstandingWorkItems`
- `PendingCompletions`
- per-node decrement paths
- final request completion transition
- [ ] Add/assert invariants in debug builds for async counters:
- `Job->ActiveGraphs`
- `Context->RemainingSolverLoops`
- `Job->Async.Outstanding`
- [ ] Audit all failure/requeue paths again to ensure every submitted work item has exactly one completion/decrement.
- [ ] Add targeted tests for:
- `PerfectHashAsyncRequeueWork()` failure path
- graph submit failure rollback
- bulk finalization when one node completes last.
- [ ] Confirm server/client connection lifecycle robustness:
- `--WaitForServer`
- `--ConnectTimeout=<ms>`
- ping/pong readiness before bulk submission.

## P0: IOCP-Only Runtime Hygiene

- [ ] Ensure no legacy threadpool work submission is used by IOCP backend paths (except unavoidable OS-internal TP activity).
- [ ] Keep server silent by default; output only when `--Verbose` is specified.
- [ ] Re-verify `--NoFileIo` on server for fast stress loops and CI-style regression runs.

## P1: Buffer Pool Redesign Completion (Lookaside/NUMA Style)

- [ ] Finalize transition to per-NUMA global size-class pools (4KB..16MB default classes).
- [ ] Ensure oversize buffers are pooled/reusable (not leaked one-offs), with guarded-list ownership for teardown.
- [ ] Validate size-class mapping and payload offsets:
- header precedes payload
- `File->BaseAddress` points to payload only
- overrun should fail fast (`PH_RAISE`).
- [ ] Add pool diagnostics (debug-only):
- allocation count
- in-use count
- exhaustion count
- per-class hit/miss.
- [ ] Harden teardown/rundown:
- safe list flush
- safe free after server stop
- no outstanding buffer references.

## P1: File I/O Pipeline Hardening (Overlapped)

- [ ] Audit all CHM01 save callbacks for large-write safety with bounded, chunked writes.
- [ ] Add explicit handling for payloads that exceed current buffer capacity:
- predictable flush-and-continue logic
- no undefined pointer arithmetic on overflow.
- [ ] Keep OG memory-mapped path untouched and validated.
- [ ] Re-check ReFS/ADS-related edge cases (table-size stream behavior) and preserve `ADS.md` notes.

## P1: Tests (Unit + E2E)

- [ ] Expand unit tests around IOCP buffer pool:
- guard-page mode on/off
- oversize class reuse
- multi-thread pop/push behavior.
- [ ] Add IOCP file-write completion unit tests:
- success path (event signal + buffer return)
- error path (propagated HRESULT + release semantics).
- [ ] Add repeatable E2E matrix scripts:
- `test1`, `hard`, `sys32-200`, `sys32-1000`
- `NoFileIo` and file-I/O modes
- standard ramp presets.

## P2: Performance Characterization + Tuning

- [ ] Re-run and capture normalized OG vs IOCP timings (same algo/flags/output drive):
- `Mulshrolate4RX`
- concurrency sets: `4`, `32`, `64`.
- [ ] Sweep per-file ramp knobs:
- `--InitialPerFileConcurrency`
- `--MaxPerFileConcurrency`
- `--IncreaseConcurrencyAfterMilliseconds`
- [ ] Profile system-vs-user overhead and memory footprint at high concurrency.
- [ ] Tune default knobs for balanced throughput and memory use.

## P2: API/CLI Polish

- [ ] Keep/confirm naming split:
- `--IocpConcurrency`
- `--MaxThreads` (default `IocpConcurrency * 2`)
- `--MaxPerFileConcurrency` (must be `<= IocpConcurrency`).
- [ ] Validate NUMA targeting options and document examples (all nodes, single node, mask).
- [ ] Decide whether `BulkCreate=` request form should also block on token like `BulkCreateDirectory`.

## Regression Commands (Baseline)

- IOCP smoke:
- `powershell -NoProfile -ExecutionPolicy Bypass -File scripts\iocp-smoke.ps1 -BuildDir build-win -Config Release`
- IOCP stress (NoFileIo):
- `powershell -NoProfile -ExecutionPolicy Bypass -File scripts\stress-sys32-iocp.ps1 -BuildDir build-win -Config Release -IocpConcurrency 32 -MaxThreads 64 -NoFileIo`
- OG stress:
- `powershell -NoProfile -ExecutionPolicy Bypass -File scripts\stress-sys32.ps1 -BuildDir build-win -Config Release -MaximumConcurrency 32`

19 changes: 18 additions & 1 deletion USAGE.txt
Original file line number Diff line number Diff line change
Expand Up @@ -719,6 +719,24 @@ Table Create Parameters:
Supplies the maximum number of seconds to try and solve an individual
graph.

--InitialPerFileConcurrency=N

Sets the initial number of graph solver work items to dispatch per
keys file (default: 1). This is useful for IOCP-driven bulk runs
where most key sets solve quickly and only "hard" sets should ramp
up additional concurrency.

--MaxPerFileConcurrency=N

Caps the maximum number of graph solver work items per keys file.
Defaults to the maximum concurrency supplied on the command line.

--IncreaseConcurrencyAfterMilliseconds=N

When set, and the first solver work item has been running longer than
the supplied interval, an additional solver work item will be queued
(up to --MaxPerFileConcurrency). Defaults to 500ms; use 0 to disable.

--FunctionHookCallbackDllPath=<Path>

Supplies a fully-qualified path to a .dll file that will be used as the
Expand Down Expand Up @@ -811,4 +829,3 @@ Mask Functions:

ID | Name
2 And

Loading
Loading