Refactor: run sim CI in single subprocess with parallel workers by hw-native-sys-bot · Pull Request #493 · hw-native-sys/simpler

hw-native-sys-bot · 2026-04-09T04:49:52Z

Summary

Replace per-runtime subprocess isolation with a single subprocess for all sim tasks — multiple runtimes coexist via handle-based DeviceRunner API (Refactor: replace DeviceRunner singleton with handle-based C API #483)
Add parallel sim execution: tasks distributed across cpu_count // 20 virtual device IDs, each with its own ChipWorker in a separate thread
run_runtime executes inside DeviceRunner::create_thread() so each invocation gets proper device binding (sim: pto_cpu_sim_bind_device, onboard: rtSetDevice) without holding Python GIL
Add reset_device_context() on onboard after each run to destroy streams + rtDeviceReset, enabling clean re-creation on the next run's thread
set_device on onboard is now a no-op — device/stream init moved to run_runtime's worker thread via ensure_device_set
Subprocess timeout via subprocess.run(timeout=) for clean kill on deadlock; sim subprocess runs quiet with PTO_LOG_LEVEL=warn

Testing

a5sim: 12/12 pass (parallel, ~10s on 320-core machine)
a2a3 onboard device 2: host_build_graph 5/5, aicpu_build_graph 4/4, tensormap_and_ringbuffer 21/21

gemini-code-assist · 2026-04-09T04:49:56Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Previously sim launched one subprocess per runtime group to avoid host SO symbol collisions. With the handle-based DeviceRunner API (hw-native-sys#483), multiple runtimes can coexist in a single process. - Replace run_sim_tasks_subprocess (per-runtime subprocesses) with a single _run_device_worker_subprocess call for all tasks - Add parallel sim execution: tasks distributed across cpu_count/20 virtual device IDs, each with its own ChipWorker in a thread - ChipWorker::run() uses std::thread internally so the real work runs outside the Python GIL, enabling true parallelism - Add timeout parameter to _run_device_worker_subprocess using subprocess.run(timeout=) for clean process kill on deadlock - Thread-safe progress: [devN] [M/total] PASS/FAIL: task (Xs)

On macOS, `python ci.py -p a2a3sim` (or a5sim) aborts every task with "OMP: Error hw-native-sys#15: Initializing libomp.dylib, but found libomp.dylib already initialized" (SIGABRT) before any DeviceRunner code runs. Two distinct libomp.dylib copies get mapped into the single CI process: homebrew's /opt/homebrew/opt/libomp/lib/libomp.dylib (via numpy -> openblas) and pip torch's .venv/.../torch/lib/libomp.dylib. They have different install names, so dyld loads them both and Intel's libomp aborts on the second init. Surfaced after hw-native-sys#493 collapsed sim CI into one long-lived Python process; each golden's `import numpy`/`import torch` now accumulates conflicting libomps in the same address space. - Set KMP_DUPLICATE_LIB_OK=TRUE at the top of ci.py on darwin, before any import that can transitively pull in numpy or torch. This is Intel's documented escape hatch; safe for our workload where numpy and torch are only used for golden reference math, not parallel OMP regions. - Document the full root cause, debugging steps, and explicit "what not to do" list in docs/macos-libomp-collision.md so future contributors don't re-investigate. Link it from docs/ci.md. - Rewrite the two remaining numpy-based goldens (a2a3/{aicpu,host}_build_graph/bgemm) in torch for style consistency with the rest of examples/. Note this does not avoid the libomp collision on its own -- `import torch` transitively imports numpy. Verified: `python ci.py` passes 32/32 sim tests (20 a2a3sim + 12 a5sim) on macOS without KMP_DUPLICATE_LIB_OK needing to be set manually.

hw-native-sys-bot force-pushed the refactor/sim-single-subprocess branch from 421d042 to d051ff2 Compare April 9, 2026 05:05

hw-native-sys-bot changed the title ~~Refactor: run sim CI in a single subprocess instead of per-runtime~~ Refactor: run sim CI in single subprocess with parallel workers Apr 9, 2026

hw-native-sys-bot force-pushed the refactor/sim-single-subprocess branch 7 times, most recently from da3e69e to 8a5e6c9 Compare April 10, 2026 07:54

hw-native-sys-bot force-pushed the refactor/sim-single-subprocess branch from 8a5e6c9 to 0089d5a Compare April 10, 2026 08:07

ChaoWao approved these changes Apr 10, 2026

View reviewed changes

ChaoWao merged commit a90b0a2 into hw-native-sys:main Apr 10, 2026
24 of 26 checks passed

ChaoWao deleted the refactor/sim-single-subprocess branch April 10, 2026 11:09

ChaoWao mentioned this pull request Apr 11, 2026

Fix: ci.py crash on macOS from duplicate libomp load #520

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: run sim CI in single subprocess with parallel workers#493

Refactor: run sim CI in single subprocess with parallel workers#493
ChaoWao merged 1 commit intohw-native-sys:mainfrom
hw-native-sys-bot:refactor/sim-single-subprocess

hw-native-sys-bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hw-native-sys-bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

gemini-code-assist bot commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hw-native-sys-bot commented Apr 9, 2026 •

edited

Loading