Skip to content

Refactor: run sim CI in single subprocess with parallel workers#493

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
hw-native-sys-bot:refactor/sim-single-subprocess
Apr 10, 2026
Merged

Refactor: run sim CI in single subprocess with parallel workers#493
ChaoWao merged 1 commit intohw-native-sys:mainfrom
hw-native-sys-bot:refactor/sim-single-subprocess

Conversation

@hw-native-sys-bot
Copy link
Copy Markdown
Collaborator

@hw-native-sys-bot hw-native-sys-bot commented Apr 9, 2026

Summary

  • Replace per-runtime subprocess isolation with a single subprocess for all sim tasks — multiple runtimes coexist via handle-based DeviceRunner API (Refactor: replace DeviceRunner singleton with handle-based C API #483)
  • Add parallel sim execution: tasks distributed across cpu_count // 20 virtual device IDs, each with its own ChipWorker in a separate thread
  • run_runtime executes inside DeviceRunner::create_thread() so each invocation gets proper device binding (sim: pto_cpu_sim_bind_device, onboard: rtSetDevice) without holding Python GIL
  • Add reset_device_context() on onboard after each run to destroy streams + rtDeviceReset, enabling clean re-creation on the next run's thread
  • set_device on onboard is now a no-op — device/stream init moved to run_runtime's worker thread via ensure_device_set
  • Subprocess timeout via subprocess.run(timeout=) for clean kill on deadlock; sim subprocess runs quiet with PTO_LOG_LEVEL=warn

Testing

  • a5sim: 12/12 pass (parallel, ~10s on 320-core machine)
  • a2a3 onboard device 2: host_build_graph 5/5, aicpu_build_graph 4/4, tensormap_and_ringbuffer 21/21

@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/sim-single-subprocess branch from 421d042 to d051ff2 Compare April 9, 2026 05:05
@hw-native-sys-bot hw-native-sys-bot changed the title Refactor: run sim CI in a single subprocess instead of per-runtime Refactor: run sim CI in single subprocess with parallel workers Apr 9, 2026
@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/sim-single-subprocess branch 7 times, most recently from da3e69e to 8a5e6c9 Compare April 10, 2026 07:54
Previously sim launched one subprocess per runtime group to avoid
host SO symbol collisions. With the handle-based DeviceRunner API
(hw-native-sys#483), multiple runtimes can coexist in a single process.

- Replace run_sim_tasks_subprocess (per-runtime subprocesses) with
  a single _run_device_worker_subprocess call for all tasks
- Add parallel sim execution: tasks distributed across cpu_count/20
  virtual device IDs, each with its own ChipWorker in a thread
- ChipWorker::run() uses std::thread internally so the real work
  runs outside the Python GIL, enabling true parallelism
- Add timeout parameter to _run_device_worker_subprocess using
  subprocess.run(timeout=) for clean process kill on deadlock
- Thread-safe progress: [devN] [M/total] PASS/FAIL: task (Xs)
@hw-native-sys-bot hw-native-sys-bot force-pushed the refactor/sim-single-subprocess branch from 8a5e6c9 to 0089d5a Compare April 10, 2026 08:07
@ChaoWao ChaoWao merged commit a90b0a2 into hw-native-sys:main Apr 10, 2026
24 of 26 checks passed
@ChaoWao ChaoWao deleted the refactor/sim-single-subprocess branch April 10, 2026 11:09
ChaoWao added a commit to ChaoWao/simpler-fork that referenced this pull request Apr 11, 2026
On macOS, `python ci.py -p a2a3sim` (or a5sim) aborts every task with
"OMP: Error hw-native-sys#15: Initializing libomp.dylib, but found libomp.dylib
already initialized" (SIGABRT) before any DeviceRunner code runs.

Two distinct libomp.dylib copies get mapped into the single CI process:
homebrew's /opt/homebrew/opt/libomp/lib/libomp.dylib (via numpy ->
openblas) and pip torch's .venv/.../torch/lib/libomp.dylib. They have
different install names, so dyld loads them both and Intel's libomp
aborts on the second init. Surfaced after hw-native-sys#493 collapsed sim CI into
one long-lived Python process; each golden's `import numpy`/`import
torch` now accumulates conflicting libomps in the same address space.

- Set KMP_DUPLICATE_LIB_OK=TRUE at the top of ci.py on darwin, before
  any import that can transitively pull in numpy or torch. This is
  Intel's documented escape hatch; safe for our workload where numpy
  and torch are only used for golden reference math, not parallel
  OMP regions.
- Document the full root cause, debugging steps, and explicit
  "what not to do" list in docs/macos-libomp-collision.md so future
  contributors don't re-investigate. Link it from docs/ci.md.
- Rewrite the two remaining numpy-based goldens
  (a2a3/{aicpu,host}_build_graph/bgemm) in torch for style consistency
  with the rest of examples/. Note this does not avoid the libomp
  collision on its own -- `import torch` transitively imports numpy.

Verified: `python ci.py` passes 32/32 sim tests (20 a2a3sim +
12 a5sim) on macOS without KMP_DUPLICATE_LIB_OK needing to be set
manually.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants