Skip to content

ci: replace deprecated zmq with pyzmq in CI scripts#3007

Open
sunway513 wants to merge 4 commits intomainfrom
fix/ci-pyzmq
Open

ci: replace deprecated zmq with pyzmq in CI scripts#3007
sunway513 wants to merge 4 commits intomainfrom
fix/ci-pyzmq

Conversation

@sunway513
Copy link
Copy Markdown
Collaborator

Summary

Files changed

  • .github/scripts/build_aiter_triton.sh — triton test setup
  • .github/workflows/aiter-test.yaml — standard + MI300X tests (3 occurrences)
  • .github/workflows/vllm_benchmark.yaml — vLLM benchmark

Test plan

  • CI pre-checks should pass (no code changes)
  • Triton test shards should no longer fail at setup with pyzmq resolution errors

@sunway513 sunway513 requested review from a team and Copilot May 3, 2026 02:27
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 3007 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates CI dependency installation to use pyzmq (the actual ZeroMQ Python bindings) instead of the deprecated zmq meta-package, addressing intermittent CI setup failures during Triton test shard runs.

Changes:

  • Replace pip install ... zmq ... with pip install ... pyzmq ... in CI workflows.
  • Update the Triton CI build helper script to install pyzmq instead of zmq.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
.github/workflows/aiter-test.yaml Switch CI dependency install from zmq to pyzmq in the main test job(s).
.github/workflows/vllm_benchmark.yaml Switch benchmark workflow dependency install from zmq to pyzmq.
.github/scripts/build_aiter_triton.sh Switch Triton setup script dependency install from zmq to pyzmq.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sunway513 added 4 commits May 3, 2026 14:28
The `zmq` meta-package fails to install on some CI runners because
it cannot resolve the `pyzmq` dependency. Use `pyzmq` directly,
which is the actual package providing ZeroMQ bindings for Python.

Fixes Triton Test Shard 7 setup failures.
Set pip global retries=15 and timeout=120s in build_aiter_triton.sh
to handle transient PyPI network failures on self-hosted runners.
Shard 5/7 failures were caused by RemoteDisconnected during pip install.
pyzmq is only used by aiter.dist.shm_broadcast, not by any triton
test. When PyPI is unreachable on self-hosted runners, the pyzmq
install failure should not block the entire CI shard.

Split pyzmq into a separate pip install with || fallback so triton
tests can proceed even when PyPI connectivity is degraded.
When batch pip install fails (e.g., PyPI connectivity issues on
self-hosted runners), retry each package individually. Only pyzmq
is allowed to fail silently since it's only used by
aiter.dist.shm_broadcast and not required by any CI test suite.

Critical packages (pandas, einops, numpy) must still succeed.
@sunway513
Copy link
Copy Markdown
Collaborator Author

CI Status: ALL GREEN (32 pass / 0 fail / 0 pending)

This PR fixes intermittent CI failures caused by pip install zmq on self-hosted runners:

  1. Replace deprecated zmq with pyzmq in all CI scripts and workflows
  2. Add pip retry logic (retries=15, timeout=120s) in build_aiter_triton.sh
  3. Make pyzmq non-blocking — if PyPI is unreachable, pyzmq install failure doesn't block the test suite (pyzmq is only used by aiter.dist.shm_broadcast, not by any CI test)

Root cause: PR #2897 introduced pip install zmq (deprecated meta-package) which intermittently fails on newer pip resolvers. The actual package is pyzmq.

@lipeng-amd Ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants