[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends#90
[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends#90albanD merged 3 commits intopytorch:masterfrom
Conversation
albanD
left a comment
There was a problem hiding this comment.
Great proposal!
Left a few comments, but the general architecture sounds great to me.
Most of my comments are about setting up the specific rules and doesn't block the first steps!
|
|
||
| The allowlist is designed to naturally support gradual progression from experimental participation to mature participation. The table below lists the requirements for advancing to each level. | ||
|
|
||
| | Phase | Level | Requirements | |
There was a problem hiding this comment.
For the requirements, it might be helpful to separate Infra availability vs legitimate test breakage.
I think the requirement we want to have here is both:
- Very strong requirement on Infra availability.
- More relaxed requirement on test breakage
We want to encourage both for sure, but they will be managed very differently so we most likely want to provide signal to backend writers independently.
There was a problem hiding this comment.
Excellent point. Will update the requirements to separate two dimensions
Hi @albanD, so happy to get your approval, thank you. The initial code for L1 and L2 is complete, and I will submit a PR soon. I'll let you know when it's finished. |
|
Hey @albanD, the new commit is ready, please help to review it again, thank you. |
|
@fffrog @can-gaa-hou, a couple follow up clarifications after looking at the pytorch/test-infra#7847 in addition to the comments left on the PR
Phase 2 should expose:
|
albanD
left a comment
There was a problem hiding this comment.
Sounds good to me!
Also I agree with Zain about splitting to make sure we can land L1 asap!
@albanD, thank you for your approval. We have been quite busy lately, but we will do our best to complete L1 by March 28th, and sorry again for the delay. |
|
Hey @ZainRizvi
Got it, thank you. We will implement the L1 plan as soon as possible. We sincerely apologize for any delays caused by our busy internal affairs.
Gotcha, thank you. Phase 2 should expose:
I knew, thank you for the detailed explanation, and we will start the L2 developement once the L1 is completed |
|
Merging! |
All PRs for L1 of this RFCTo implement L1 of this RFC, we have submitted the following three PRs across different repositories to set up the necessary infrastructure and features:
These PRs are interconnected and collectively fulfill the implemetation outlined in the RFC. |
Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation. The PyTorch repository should be the preferred location for storing the allowlist.yml file. Downstream repositories are essentially extensions (or plugins) of the PyTorch repository, so they should only need to be aware of PyTorch itself. Infrastructure repositories such as test-infra and ci-infra should remain transparent to downstream repositories and not be directly exposed. ghstack-source-id: a936347 Pull-Request: #178681
Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation. The PyTorch repository should be the preferred location for storing the allowlist.yml file. Downstream repositories are essentially extensions (or plugins) of the PyTorch repository, so they should only need to be aware of PyTorch itself. Infrastructure repositories such as test-infra and ci-infra should remain transparent to downstream repositories and not be directly exposed. ghstack-source-id: b187886 Pull-Request: #178681
Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation. The PyTorch repository should be the preferred location for storing the allowlist.yml file. Downstream repositories are essentially extensions (or plugins) of the PyTorch repository, so they should only need to be aware of PyTorch itself. Infrastructure repositories such as test-infra and ci-infra should remain transparent to downstream repositories and not be directly exposed. ghstack-source-id: 90e2e71 Pull-Request: #178681
Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation. The PyTorch repository should be the preferred location for storing the allowlist.yml file. Downstream repositories are essentially extensions (or plugins) of the PyTorch repository, so they should only need to be aware of PyTorch itself. Infrastructure repositories such as test-infra and ci-infra should remain transparent to downstream repositories and not be directly exposed. ghstack-source-id: 052e046 Pull-Request: #178681
## Author - @can-gaa-hou - @KarhouTam # Summary Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation. This PR implements the initial version of the cross-repository CI relay described in [[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends](pytorch/rfcs#90). The current implementation focuses on the first two levels defined in the RFC: - `L1`: downstream repos can be onboarded and triggered through the relay Higher-level behaviors for `L2`, `L3`, and `L4` are intentionally left for follow-up work. # Architecture The relay is split into two AWS Lambda functions: - `webhook_handler` - [x] receives GitHub webhook PR and push events from the upstream repo - [x] validates webhook signatures and authenticates with AWS Secret Manager - [x] reads the downstream whitelist from the URL and stores it in Redis - [x] for `create`/`reopen`/`synchronize` actions, forwards repository_dispatch events to downstream repos # Changes ```md .github/workflows/ ├── cross-repo-ci-relay-tests.yml # CI workflow for cross-repo-ci-relay └──_lambda-do-release-runners.yml # Add cross-repo-ci-relay release workflow aws/lambda/cross_repo_ci_relay/ ├── tests/ # Unit tests for cross-repo-ci-relay ├── README.md # project overview, env vars, build/deploy, and callback usage ├── Makefile # build, package, deploy, and clean commands for Lambda ├── allowlist.py # Functions to handle the allowlist from GitHub ├── config.py # shared runtime config loading ├── utils.py # shared utility helpers and common exceptions ├── redis_helper.py # Redis helpers for whitelist cache ├── lambda_function.py # Lambda entrypoint for GitHub webhook requests ├── gh_helper.py # GitHub App / repository_dispatch client helpers ├── event_handler.py # Functions to handle PR and push events ├── local_server.py # For local tests, see README.md └── requirements.txt # Python dependencies for the webhook Lambda package ``` # Usage See README.md for more details. # Verification We performed the following scenario verification on our AWS Lambda instance: - [x] Test with Upstream PR create/reopen/synchronize and push events triggering webhook, then redispatching to the Downstream CI (different organization) workflow. # Terraform configuration pytorch/ci-infra#415 # Unit Tests - [x] Unit Tests (Mock) cc @fffrog --------- Co-authored-by: KarhouTam <karhou.tam@outlook.com> Co-authored-by: fffrog <ljw1101.vip@gmail.com>
Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation. The PyTorch repository should be the preferred location for storing the allowlist.yml file. Downstream repositories are essentially extensions (or plugins) of the PyTorch repository, so they should only need to be aware of PyTorch itself. Infrastructure repositories such as test-infra and ci-infra should remain transparent to downstream repositories and not be directly exposed. Pull Request resolved: #178681 Approved by: https://github.com/ZainRizvi, https://github.com/albanD
| - **Upstream/downstream decoupling:** Downstream repos **only need to install this App to join the cross-repo CI coordination**. Downstream repos do not need an upstream token, and the upstream does not need to know about the downstream. All interactions are bridged through the GitHub App and Relay Server. | ||
|
|
||
| > \[!NOTE\] | ||
| > This GitHub App should be created under the `pytorch` organization and owned by the PyTorch team or the LF AI & Data Foundation team, to ensure credibility. An App created by a third party will face trust issues during installation and adoption. |
There was a problem hiding this comment.
LF AI & Data Foundation should be "LF Pytorch Foundation" team.
LF AI & Data Foundation is a different foundation under the left separate from the PyTorch Foundation.
## Note Due to the restrictions on secret injection in the fork Repo scenario on GitHub, a new PR needs to be created to replace the old one (#415). Please refer to the old PR for a detailed discussion. ## Summary Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation. - Add Terraform infrastructure for CRCR (Cross-Repository CI Relay), a GitHub webhook relay service for PyTorch out-of-tree backends that receives upstream webhook events via a GitHub App and forwards `repository_dispatch` events to registered downstream repositories - Infrastructure includes: Lambda function (webhook handler), ElastiCache Redis (allowlist caching), dedicated VPC, IAM roles, and Lambda Function URL - Add two GitHub Actions workflows: `crcr-on-pr.yml` and `crcr-deploy-prod.yml` **Notes:** This PR need to wait [this](pytorch/test-infra#7847) merged first for purpose of updating tag field in Terrafile. ## Architecture GitHub App → Lambda webhook (Function URL) → `repository_dispatch` → downstream repos **AWS Resources (us-east-1, account 391835788720):** - Lambda function (`cross_repo_ci_webhook`) with Python 3.10 runtime - ElastiCache Redis replication group for allowlist caching - VPC with private subnets for Lambda ↔ Redis connectivity - IAM role with Secrets Manager, VPC networking, and CloudWatch Logs permissions - S3 backend for Terraform state ## Test Multiple deployments and verifications have been completed on personal AWS environment. --------- Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com>
This RFC has been under discussion for several weeks, you can visit this link to see previous discussions if you are interesed in.
And thanks a lot @ZainRizvi, @seemethere, @afrittoli, @zxiiro, @mikaylagawarecki, and @jewelkm89 for the valuable suggestions.
Click here to see a preview of this RFC.