Add Cross Repository CI Relay (CRCR) infrastructure#415
Add Cross Repository CI Relay (CRCR) infrastructure#415fffrog wants to merge 6 commits intopytorch:mainfrom
Conversation
1529730 to
dc11bf6
Compare
ZainRizvi
left a comment
There was a problem hiding this comment.
Good progress on the previous feedback! A few more things I noticed.
ZainRizvi
left a comment
There was a problem hiding this comment.
Two more small things I noticed.
|
@ZainRizvi I'm so sorry for the inconvenience, I am fresh new to AWS and Terraform. However, I learned a lot from your comments. Thank you again. |
@fffrog no need to feel sorry. Back and forth during code review is expected and is a valuable part of the process. Thanks for contributing to the project! |
|
Hi, @zxiiro , thank for a lot for your patience. |
## Author - @can-gaa-hou - @KarhouTam # Summary Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation. This PR implements the initial version of the cross-repository CI relay described in [[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends](pytorch/rfcs#90). The current implementation focuses on the first two levels defined in the RFC: - `L1`: downstream repos can be onboarded and triggered through the relay Higher-level behaviors for `L2`, `L3`, and `L4` are intentionally left for follow-up work. # Architecture The relay is split into two AWS Lambda functions: - `webhook_handler` - [x] receives GitHub webhook PR and push events from the upstream repo - [x] validates webhook signatures and authenticates with AWS Secret Manager - [x] reads the downstream whitelist from the URL and stores it in Redis - [x] for `create`/`reopen`/`synchronize` actions, forwards repository_dispatch events to downstream repos # Changes ```md .github/workflows/ ├── cross-repo-ci-relay-tests.yml # CI workflow for cross-repo-ci-relay └──_lambda-do-release-runners.yml # Add cross-repo-ci-relay release workflow aws/lambda/cross_repo_ci_relay/ ├── tests/ # Unit tests for cross-repo-ci-relay ├── README.md # project overview, env vars, build/deploy, and callback usage ├── Makefile # build, package, deploy, and clean commands for Lambda ├── allowlist.py # Functions to handle the allowlist from GitHub ├── config.py # shared runtime config loading ├── utils.py # shared utility helpers and common exceptions ├── redis_helper.py # Redis helpers for whitelist cache ├── lambda_function.py # Lambda entrypoint for GitHub webhook requests ├── gh_helper.py # GitHub App / repository_dispatch client helpers ├── event_handler.py # Functions to handle PR and push events ├── local_server.py # For local tests, see README.md └── requirements.txt # Python dependencies for the webhook Lambda package ``` # Usage See README.md for more details. # Verification We performed the following scenario verification on our AWS Lambda instance: - [x] Test with Upstream PR create/reopen/synchronize and push events triggering webhook, then redispatching to the Downstream CI (different organization) workflow. # Terraform configuration pytorch/ci-infra#415 # Unit Tests - [x] Unit Tests (Mock) cc @fffrog --------- Co-authored-by: KarhouTam <karhou.tam@outlook.com> Co-authored-by: fffrog <ljw1101.vip@gmail.com>
|
This is ready to merge once the tag is updated and the merge commit with the main branch is resolved |
Has the GitHub App already been created? According to the RFC it should be created in the
Once the app is created though I can definitely help with adding the secrets to the ci-infra repo before we merge this PR. |
**Summary**: - Add Terraform infrastructure for CRCR (Cross-Repository CI Relay), a GitHub webhook relay service for PyTorch out-of-tree backends that receives upstream webhook events via a GitHub App and forwards `repository_dispatch` events to registered downstream repositories - Infrastructure includes: Lambda function (webhook handler), ElastiCache Redis (allowlist caching), dedicated VPC, IAM roles, and Lambda Function URL - Add two GitHub Actions workflows: `crcr-on-pr.yml` and `crcr-deploy-prod.yml` **Architecture**: GitHub App → Lambda webhook (Function URL) → `repository_dispatch` → downstream repos - Lambda function (`cross_repo_ci_webhook`) with Python 3.10 runtime - ElastiCache Redis replication group (`cache.t3.small`) for allowlist caching - VPC with private subnets for Lambda ↔ Redis connectivity - IAM role with Secrets Manager, VPC networking, and CloudWatch Logs permissions - S3 backend for Terraform state **Test**: Multiple deployments and verifications have been completed on personal AWS environment. Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com>
- Fix some bugs about elasticache - update README.md to correspond to the crcr code.
- creating secret manager via terraform rather than creating manually - Moving REDIS_LOGIN from environment to secret manager - Optimize all terraform label and name to be clearer and more understanding
Hi @ZainRizvi, thanks! I’ve rebased onto main and updated the tag. |
|
@zxiiro Thank you a lot.
I’m afraid not — the GitHub App needs to be created first.
|
|
Alright after talking to @fffrog we agreed that it should be possible to create the app in the pytorch-fdn Org which I do have permissions to create apps under. I went ahead and created the app @fffrog your doc mentioned to grant permissions for "Actions" with "Read & Write" permissions but its unclear to me if the app actually needs that. Will we be utilizing that in future? If not needed maybe we can reduce the permissions. Only open item now is I need to update the Webhook URL once the Lambda is created. If everyone agrees / is ready I think we should be good to go to merge this PR to get the lambdas created. |
Hi @zxiiro , thank you a lot for creating GitHub App and approving the PR. Yes, we need actions with 'Read & Write', because we will need to rerun the failure action in downstream repo from PyTorch side from L3, but the design of L3 is not finalized yet, so we can remove it temporarily now and add it back if we really need it in the future |
As the title stated.
|
Hi @ZainRizvi @zxiiro, The workflow failed yesterday [1][2], so I’ve pushed some updates to address it:
Regarding [2], the root cause seems to be that GitHub restricts Secret injection for PRs originating from a forked repository. To resolve this, would it be possible for me to create a branch directly within |
Yes, we do require folks working on ci-infra to be able to write directly to the repo. I've sent you an invite to join the repo with "write" access. Once you accept you should be able to push to a branch directly in ci-infra rather than your fork. |
Hi @zxiiro, thank you a lot for inviting me as a collaborator, I have accepted it and will push commit and create a PR, thank you again. |
|
close this PR because creating a new one for execution of workflow. |
Related Comments: - #415 (comment) - #415 (comment) - #415 (comment) - #415 (comment)
Related Comments: - #415 (comment) - #415 (comment) - #415 (comment) - #415 (comment)
## Note Due to the restrictions on secret injection in the fork Repo scenario on GitHub, a new PR needs to be created to replace the old one (#415). Please refer to the old PR for a detailed discussion. ## Summary Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation. - Add Terraform infrastructure for CRCR (Cross-Repository CI Relay), a GitHub webhook relay service for PyTorch out-of-tree backends that receives upstream webhook events via a GitHub App and forwards `repository_dispatch` events to registered downstream repositories - Infrastructure includes: Lambda function (webhook handler), ElastiCache Redis (allowlist caching), dedicated VPC, IAM roles, and Lambda Function URL - Add two GitHub Actions workflows: `crcr-on-pr.yml` and `crcr-deploy-prod.yml` **Notes:** This PR need to wait [this](pytorch/test-infra#7847) merged first for purpose of updating tag field in Terrafile. ## Architecture GitHub App → Lambda webhook (Function URL) → `repository_dispatch` → downstream repos **AWS Resources (us-east-1, account 391835788720):** - Lambda function (`cross_repo_ci_webhook`) with Python 3.10 runtime - ElastiCache Redis replication group for allowlist caching - VPC with private subnets for Lambda ↔ Redis connectivity - IAM role with Secrets Manager, VPC networking, and CloudWatch Logs permissions - S3 backend for Terraform state ## Test Multiple deployments and verifications have been completed on personal AWS environment. --------- Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com>
Summary
Please refer to this comment for the overall implementation.
repository_dispatchevents to registered downstream repositoriescrcr-on-pr.ymlandcrcr-deploy-prod.ymlNotes:
This PR need to wait this merged first for purpose of updating tag field in Terrafile.
Architecture
GitHub App → Lambda webhook (Function URL) →
repository_dispatch→ downstream reposAWS Resources (us-east-1, account 391835788720):
cross_repo_ci_webhook) with Python 3.10 runtimeTest
Multiple deployments and verifications have been completed on personal AWS environment.