Skip to content

Add Cross Repository CI Relay (CRCR) infrastructure#415

Closed
fffrog wants to merge 6 commits intopytorch:mainfrom
fffrog:L1
Closed

Add Cross Repository CI Relay (CRCR) infrastructure#415
fffrog wants to merge 6 commits intopytorch:mainfrom
fffrog:L1

Conversation

@fffrog
Copy link
Copy Markdown
Collaborator

@fffrog fffrog commented Mar 28, 2026

Summary

Please refer to this comment for the overall implementation.

  • Add Terraform infrastructure for CRCR (Cross-Repository CI Relay), a GitHub webhook relay service for PyTorch out-of-tree backends that receives upstream webhook events via a GitHub App and forwards repository_dispatch events to registered downstream repositories
  • Infrastructure includes: Lambda function (webhook handler), ElastiCache Redis (allowlist caching), dedicated VPC, IAM roles, and Lambda Function URL
  • Add two GitHub Actions workflows: crcr-on-pr.yml and crcr-deploy-prod.yml

Notes:

This PR need to wait this merged first for purpose of updating tag field in Terrafile.

Architecture

GitHub App → Lambda webhook (Function URL) → repository_dispatch → downstream repos

AWS Resources (us-east-1, account 391835788720):

  • Lambda function (cross_repo_ci_webhook) with Python 3.10 runtime
  • ElastiCache Redis replication group for allowlist caching
  • VPC with private subnets for Lambda ↔ Redis connectivity
  • IAM role with Secrets Manager, VPC networking, and CloudWatch Logs permissions
  • S3 backend for Terraform state

Test

Multiple deployments and verifications have been completed on personal AWS environment.

Copy link
Copy Markdown
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @fffrog, thanks for setting this up! The overall structure looks good — clean separation of VPC, ElastiCache, Lambda, and IAM. Found a few issues that need fixing before this can deploy correctly though.

@fffrog fffrog force-pushed the L1 branch 4 times, most recently from 1529730 to dc11bf6 Compare April 1, 2026 13:00
Copy link
Copy Markdown
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress on the previous feedback! A few more things I noticed.

Copy link
Copy Markdown
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more small things I noticed.

@fffrog
Copy link
Copy Markdown
Collaborator Author

fffrog commented Apr 2, 2026

@ZainRizvi I'm so sorry for the inconvenience, I am fresh new to AWS and Terraform.

However, I learned a lot from your comments. Thank you again.

@zxiiro
Copy link
Copy Markdown
Collaborator

zxiiro commented Apr 2, 2026

@ZainRizvi I'm so sorry for the inconvenience, I am fresh new to AWS and Terraform.

@fffrog no need to feel sorry. Back and forth during code review is expected and is a valuable part of the process. Thanks for contributing to the project!

@fffrog
Copy link
Copy Markdown
Collaborator Author

fffrog commented Apr 2, 2026

Hi, @zxiiro , thank for a lot for your patience.

Copy link
Copy Markdown
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking much better! And please don't feel bad about the PR feedback, like @zxiiro said, the back and forth is normal. Just a few more bits of feedback

ZainRizvi pushed a commit to pytorch/test-infra that referenced this pull request Apr 8, 2026
## Author

- @can-gaa-hou 
- @KarhouTam 

# Summary
Please refer to this
[comment](pytorch/rfcs#90 (comment))
for the overall implementation.


This PR implements the initial version of the cross-repository CI relay
described in [[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree
Backends](pytorch/rfcs#90).

The current implementation focuses on the first two levels defined in
the RFC:

- `L1`: downstream repos can be onboarded and triggered through the
relay

Higher-level behaviors for `L2`, `L3`, and `L4` are intentionally left
for follow-up work.

# Architecture

The relay is split into two AWS Lambda functions:

- `webhook_handler`
- [x] receives GitHub webhook PR and push events from the upstream repo
- [x] validates webhook signatures and authenticates with AWS Secret
Manager
- [x] reads the downstream whitelist from the URL and stores it in Redis
- [x] for `create`/`reopen`/`synchronize` actions, forwards
repository_dispatch events to downstream repos

# Changes
```md
.github/workflows/
├── cross-repo-ci-relay-tests.yml  # CI workflow for cross-repo-ci-relay
└──_lambda-do-release-runners.yml  # Add cross-repo-ci-relay release workflow

aws/lambda/cross_repo_ci_relay/
├── tests/                         # Unit tests for cross-repo-ci-relay
├── README.md                      # project overview, env vars, build/deploy, and callback usage
├── Makefile                       # build, package, deploy, and clean commands for Lambda
├── allowlist.py                   # Functions to handle the allowlist from GitHub
├── config.py                      # shared runtime config loading
├── utils.py                       # shared utility helpers and common exceptions
├── redis_helper.py                # Redis helpers for whitelist cache
├── lambda_function.py             # Lambda entrypoint for GitHub webhook requests
├── gh_helper.py                   # GitHub App / repository_dispatch client helpers
├── event_handler.py               # Functions to handle PR and push events
├── local_server.py                # For local tests, see README.md
└── requirements.txt               # Python dependencies for the webhook Lambda package
```

# Usage

See README.md for more details.

# Verification

We performed the following scenario verification on our AWS Lambda
instance:

- [x] Test with Upstream PR create/reopen/synchronize and push events
triggering webhook, then redispatching to the Downstream CI (different
organization) workflow.

# Terraform configuration

pytorch/ci-infra#415

# Unit Tests

- [x] Unit Tests (Mock)


cc @fffrog

---------

Co-authored-by: KarhouTam <karhou.tam@outlook.com>
Co-authored-by: fffrog <ljw1101.vip@gmail.com>
@ZainRizvi
Copy link
Copy Markdown
Contributor

This is ready to merge once the tag is updated and the merge commit with the main branch is resolved

@ZainRizvi
Copy link
Copy Markdown
Contributor

@zxiiro would you be able to help @fffrog get the secrets required into secret store?

@zxiiro
Copy link
Copy Markdown
Collaborator

zxiiro commented Apr 8, 2026

@zxiiro would you be able to help @fffrog get the secrets required into secret store?

Has the GitHub App already been created?

According to the RFC it should be created in the pytorch org. Which I don't have access to the org level administrative permissions, I think only Meta folks have the permissions at the org level to do that?

This GitHub App should be created under the pytorch organization and owned by the PyTorch team or the LF AI & Data Foundation team, to ensure credibility. An App created by a third party will face trust issues during installation and adoption.

Once the app is created though I can definitely help with adding the secrets to the ci-infra repo before we merge this PR.

fffrog and others added 5 commits April 9, 2026 10:25
**Summary**:

- Add Terraform infrastructure for CRCR (Cross-Repository CI Relay), a GitHub webhook relay service for PyTorch out-of-tree backends that receives upstream webhook events via a GitHub App and forwards `repository_dispatch` events to registered downstream repositories
- Infrastructure includes: Lambda function (webhook handler), ElastiCache Redis (allowlist caching), dedicated VPC, IAM roles, and Lambda Function URL
- Add two GitHub Actions workflows: `crcr-on-pr.yml` and `crcr-deploy-prod.yml`

**Architecture**:

GitHub App → Lambda webhook (Function URL) → `repository_dispatch` → downstream repos

- Lambda function (`cross_repo_ci_webhook`) with Python 3.10 runtime
- ElastiCache Redis replication group (`cache.t3.small`) for allowlist caching
- VPC with private subnets for Lambda ↔ Redis connectivity
- IAM role with Secrets Manager, VPC networking, and CloudWatch Logs permissions
- S3 backend for Terraform state

**Test**:

Multiple deployments and verifications have been completed on personal AWS environment.

Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com>
- Fix some bugs about elasticache
- update README.md to correspond to the crcr code.
- creating secret manager via terraform rather than creating manually
- Moving REDIS_LOGIN from environment to secret manager
- Optimize all terraform label and name to be clearer and more understanding
@fffrog
Copy link
Copy Markdown
Collaborator Author

fffrog commented Apr 9, 2026

This is ready to merge once the tag is updated and the merge commit with the main branch is resolved

Hi @ZainRizvi, thanks! I’ve rebased onto main and updated the tag.

@fffrog
Copy link
Copy Markdown
Collaborator Author

fffrog commented Apr 9, 2026

@zxiiro Thank you a lot.

Has the GitHub App already been created?
According to the RFC it should be created in the pytorch org. Which I don't have access to the org level administrative permissions, I think only Meta folks have the permissions at the org level to do that?

I’m afraid not — the GitHub App needs to be created first.
@ZainRizvi, could you help with this or point us to someone who can? We can provide a detailed guide on how to create the GitHub App.

Once the app is created though I can definitely help with adding the secrets to the ci-infra repo before we merge this PR.
Thank you very much @zxiiro

@zxiiro
Copy link
Copy Markdown
Collaborator

zxiiro commented Apr 9, 2026

Alright after talking to @fffrog we agreed that it should be possible to create the app in the pytorch-fdn Org which I do have permissions to create apps under. I went ahead and created the app pytorch-fdn-cross-repo-ci-relay which has been installed into the pytorch/pytorch repo it's limited to specifically that repo only.

@fffrog your doc mentioned to grant permissions for "Actions" with "Read & Write" permissions but its unclear to me if the app actually needs that. Will we be utilizing that in future? If not needed maybe we can reduce the permissions.

Only open item now is I need to update the Webhook URL once the Lambda is created. If everyone agrees / is ready I think we should be good to go to merge this PR to get the lambdas created.

@fffrog
Copy link
Copy Markdown
Collaborator Author

fffrog commented Apr 9, 2026

@fffrog your doc mentioned to grant permissions for "Actions" with "Read & Write" permissions but its unclear to me if the app actually needs that. Will we be utilizing that in future? If not needed maybe we can reduce the permissions.

Hi @zxiiro , thank you a lot for creating GitHub App and approving the PR.

Yes, we need actions with 'Read & Write', because we will need to rerun the failure action in downstream repo from PyTorch side from L3, but the design of L3 is not finalized yet, so we can remove it temporarily now and add it back if we really need it in the future

As the title stated.
@fffrog
Copy link
Copy Markdown
Collaborator Author

fffrog commented Apr 10, 2026

Hi @ZainRizvi @zxiiro,

The workflow failed yesterday [1][2], so I’ve pushed some updates to address it:

  • Referenced arc settings and added crcr to .checkov.yml to fix [1].

Regarding [2], the root cause seems to be that GitHub restricts Secret injection for PRs originating from a forked repository.

To resolve this, would it be possible for me to create a branch directly within pytorch/ci-infra? I'm not sure if I have the necessary write access. If not, could you advise on the best way to handle Secret-dependent workflows for external contributors?

@zxiiro
Copy link
Copy Markdown
Collaborator

zxiiro commented Apr 10, 2026

Regarding [2], the root cause seems to be that GitHub restricts Secret injection for PRs originating from a forked repository.

To resolve this, would it be possible for me to create a branch directly within pytorch/ci-infra? I'm not sure if I have the necessary write access. If not, could you advise on the best way to handle Secret-dependent workflows for external contributors?

Yes, we do require folks working on ci-infra to be able to write directly to the repo. I've sent you an invite to join the repo with "write" access. Once you accept you should be able to push to a branch directly in ci-infra rather than your fork.

@fffrog
Copy link
Copy Markdown
Collaborator Author

fffrog commented Apr 10, 2026

Yes, we do require folks working on ci-infra to be able to write directly to the repo. I've sent you an invite to join the repo with "write" access. Once you accept you should be able to push to a branch directly in ci-infra rather than your fork.

Hi @zxiiro, thank you a lot for inviting me as a collaborator, I have accepted it and will push commit and create a PR, thank you again.

@fffrog
Copy link
Copy Markdown
Collaborator Author

fffrog commented Apr 10, 2026

close this PR because creating a new one for execution of workflow.

@fffrog fffrog closed this Apr 10, 2026
@fffrog fffrog mentioned this pull request Apr 14, 2026
13 tasks
github-merge-queue bot pushed a commit that referenced this pull request Apr 14, 2026
## Note

Due to the restrictions on secret injection in the fork Repo scenario on
GitHub, a new PR needs to be created to replace the old one
(#415). Please refer to the old
PR for a detailed discussion.

## Summary

Please refer to this
[comment](pytorch/rfcs#90 (comment))
for the overall implementation.

- Add Terraform infrastructure for CRCR (Cross-Repository CI Relay), a
GitHub webhook relay service for PyTorch out-of-tree backends that
receives upstream webhook events via a GitHub App and forwards
`repository_dispatch` events to registered downstream repositories
- Infrastructure includes: Lambda function (webhook handler),
ElastiCache Redis (allowlist caching), dedicated VPC, IAM roles, and
Lambda Function URL
- Add two GitHub Actions workflows: `crcr-on-pr.yml` and
`crcr-deploy-prod.yml`

**Notes:**

This PR need to wait
[this](pytorch/test-infra#7847) merged first for
purpose of updating tag field in Terrafile.

## Architecture

GitHub App → Lambda webhook (Function URL) → `repository_dispatch` →
downstream repos

**AWS Resources (us-east-1, account 391835788720):**
- Lambda function (`cross_repo_ci_webhook`) with Python 3.10 runtime
- ElastiCache Redis replication group for allowlist caching
- VPC with private subnets for Lambda ↔ Redis connectivity
- IAM role with Secrets Manager, VPC networking, and CloudWatch Logs
permissions
- S3 backend for Terraform state

## Test

Multiple deployments and verifications have been completed on personal
AWS environment.

---------

Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants