Skip to content

[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends#90

Merged
albanD merged 3 commits intopytorch:masterfrom
fffrog:relay
Mar 27, 2026
Merged

[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends#90
albanD merged 3 commits intopytorch:masterfrom
fffrog:relay

Conversation

@fffrog
Copy link
Copy Markdown
Contributor

@fffrog fffrog commented Mar 10, 2026

This RFC has been under discussion for several weeks, you can visit this link to see previous discussions if you are interesed in.

And thanks a lot @ZainRizvi, @seemethere, @afrittoli, @zxiiro, @mikaylagawarecki, and @jewelkm89 for the valuable suggestions.

Click here to see a preview of this RFC.

Copy link
Copy Markdown
Contributor

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great proposal!

Left a few comments, but the general architecture sounds great to me.
Most of my comments are about setting up the specific rules and doesn't block the first steps!

Comment thread RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md Outdated
Comment thread RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md Outdated
Comment thread RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md Outdated
Comment thread RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md Outdated
Comment thread RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md Outdated
Comment thread RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md Outdated
Comment thread RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md Outdated

The allowlist is designed to naturally support gradual progression from experimental participation to mature participation. The table below lists the requirements for advancing to each level.

| Phase | Level | Requirements |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the requirements, it might be helpful to separate Infra availability vs legitimate test breakage.

I think the requirement we want to have here is both:

  • Very strong requirement on Infra availability.
  • More relaxed requirement on test breakage

We want to encourage both for sure, but they will be managed very differently so we most likely want to provide signal to backend writers independently.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point. Will update the requirements to separate two dimensions

@fffrog
Copy link
Copy Markdown
Contributor Author

fffrog commented Mar 18, 2026

Left a few comments, but the general architecture sounds great to me.
Most of my comments are about setting up the specific rules and doesn't block the first steps!

Hi @albanD, so happy to get your approval, thank you.

The initial code for L1 and L2 is complete, and I will submit a PR soon. I'll let you know when it's finished.

@fffrog
Copy link
Copy Markdown
Contributor Author

fffrog commented Mar 18, 2026

Hey @albanD, the new commit is ready, please help to review it again, thank you.

@ZainRizvi
Copy link
Copy Markdown
Contributor

@fffrog @can-gaa-hou, a couple follow up clarifications after looking at the pytorch/test-infra#7847 in addition to the comments left on the PR

  1. Let's split out implementation into separate PRs for Phase 1 and Phase 2, to keep the added complexity of phase 2 from blocking phase 1 support

  2. Data flow for Phase 1 should be Github webhook -> AWS Lambda -> receiving repos.

Phase 2 should expose:

  1. A receiver lambda on AWS that just validates the incoming request and forwards it to HUD. We add this hop because requests to HUD require an extra security header to avoid bot spam, and this hop will let us avoid having to share that header broadly (and let us rotate it quickly in the future if needed).

  2. An endpoint on HUD (not an AWS lambda) to receive CI results. HUD remains the central processing location for any ClickHouse data

  3. Keeping with the principle of getting clickhouse data to be updated through HUD, when we want to log into clickhouse that a run has been started (for phase 2 repos) we should make an async call to HUD from the AWS lambda.

Copy link
Copy Markdown
Contributor

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!
Also I agree with Zain about splitting to make sure we can land L1 asap!

@fffrog
Copy link
Copy Markdown
Contributor Author

fffrog commented Mar 25, 2026

Also I agree with Zain about splitting to make sure we can land L1 asap!

@albanD, thank you for your approval. We have been quite busy lately, but we will do our best to complete L1 by March 28th, and sorry again for the delay.

@fffrog
Copy link
Copy Markdown
Contributor Author

fffrog commented Mar 25, 2026

Hey @ZainRizvi

Let's split out implementation into separate PRs for Phase 1 and Phase 2, to keep the added complexity of phase 2 from blocking phase 1 support

Got it, thank you. We will implement the L1 plan as soon as possible. We sincerely apologize for any delays caused by our busy internal affairs.

Data flow for Phase 1 should be Github webhook -> AWS Lambda -> receiving repos.

Gotcha, thank you.

Phase 2 should expose:

A receiver lambda on AWS that just validates the incoming request and forwards it to HUD. We add this hop because requests to HUD require an extra security header to avoid bot spam, and this hop will let us avoid having to share that header broadly (and let us rotate it quickly in the future if needed).
An endpoint on HUD (not an AWS lambda) to receive CI results. HUD remains the central processing location for any ClickHouse data
Keeping with the principle of getting clickhouse data to be updated through HUD, when we want to log into clickhouse that a run has been started (for phase 2 repos) we should make an async call to HUD from the AWS lambda.

I knew, thank you for the detailed explanation, and we will start the L2 developement once the L1 is completed

@albanD
Copy link
Copy Markdown
Contributor

albanD commented Mar 27, 2026

Merging!
We can do update and follow ups via other PRs if needed.

@albanD albanD merged commit e452220 into pytorch:master Mar 27, 2026
1 check passed
@fffrog
Copy link
Copy Markdown
Contributor Author

fffrog commented Mar 28, 2026

All PRs for L1 of this RFC

To implement L1 of this RFC, we have submitted the following three PRs across different repositories to set up the necessary infrastructure and features:

These PRs are interconnected and collectively fulfill the implemetation outlined in the RFC.

cc @albanD @ZainRizvi @zxiiro

fffrog added a commit to pytorch/pytorch that referenced this pull request Mar 28, 2026
Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation.

The PyTorch repository should be the preferred location for storing the allowlist.yml file.

Downstream repositories are essentially extensions (or plugins) of the PyTorch repository, so they should only need to be aware of PyTorch itself. Infrastructure repositories such as test-infra and ci-infra should remain transparent to downstream repositories and not be directly exposed.

ghstack-source-id: a936347
Pull-Request: #178681
fffrog added a commit to pytorch/pytorch that referenced this pull request Mar 30, 2026
Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation.

The PyTorch repository should be the preferred location for storing the allowlist.yml file.

Downstream repositories are essentially extensions (or plugins) of the PyTorch repository, so they should only need to be aware of PyTorch itself. Infrastructure repositories such as test-infra and ci-infra should remain transparent to downstream repositories and not be directly exposed.

ghstack-source-id: b187886
Pull-Request: #178681
fffrog added a commit to pytorch/pytorch that referenced this pull request Mar 30, 2026
Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation.

The PyTorch repository should be the preferred location for storing the allowlist.yml file.

Downstream repositories are essentially extensions (or plugins) of the PyTorch repository, so they should only need to be aware of PyTorch itself. Infrastructure repositories such as test-infra and ci-infra should remain transparent to downstream repositories and not be directly exposed.

ghstack-source-id: 90e2e71
Pull-Request: #178681
fffrog added a commit to pytorch/pytorch that referenced this pull request Apr 1, 2026
Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation.

The PyTorch repository should be the preferred location for storing the allowlist.yml file.

Downstream repositories are essentially extensions (or plugins) of the PyTorch repository, so they should only need to be aware of PyTorch itself. Infrastructure repositories such as test-infra and ci-infra should remain transparent to downstream repositories and not be directly exposed.

ghstack-source-id: 052e046
Pull-Request: #178681
ZainRizvi pushed a commit to pytorch/test-infra that referenced this pull request Apr 8, 2026
## Author

- @can-gaa-hou 
- @KarhouTam 

# Summary
Please refer to this
[comment](pytorch/rfcs#90 (comment))
for the overall implementation.


This PR implements the initial version of the cross-repository CI relay
described in [[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree
Backends](pytorch/rfcs#90).

The current implementation focuses on the first two levels defined in
the RFC:

- `L1`: downstream repos can be onboarded and triggered through the
relay

Higher-level behaviors for `L2`, `L3`, and `L4` are intentionally left
for follow-up work.

# Architecture

The relay is split into two AWS Lambda functions:

- `webhook_handler`
- [x] receives GitHub webhook PR and push events from the upstream repo
- [x] validates webhook signatures and authenticates with AWS Secret
Manager
- [x] reads the downstream whitelist from the URL and stores it in Redis
- [x] for `create`/`reopen`/`synchronize` actions, forwards
repository_dispatch events to downstream repos

# Changes
```md
.github/workflows/
├── cross-repo-ci-relay-tests.yml  # CI workflow for cross-repo-ci-relay
└──_lambda-do-release-runners.yml  # Add cross-repo-ci-relay release workflow

aws/lambda/cross_repo_ci_relay/
├── tests/                         # Unit tests for cross-repo-ci-relay
├── README.md                      # project overview, env vars, build/deploy, and callback usage
├── Makefile                       # build, package, deploy, and clean commands for Lambda
├── allowlist.py                   # Functions to handle the allowlist from GitHub
├── config.py                      # shared runtime config loading
├── utils.py                       # shared utility helpers and common exceptions
├── redis_helper.py                # Redis helpers for whitelist cache
├── lambda_function.py             # Lambda entrypoint for GitHub webhook requests
├── gh_helper.py                   # GitHub App / repository_dispatch client helpers
├── event_handler.py               # Functions to handle PR and push events
├── local_server.py                # For local tests, see README.md
└── requirements.txt               # Python dependencies for the webhook Lambda package
```

# Usage

See README.md for more details.

# Verification

We performed the following scenario verification on our AWS Lambda
instance:

- [x] Test with Upstream PR create/reopen/synchronize and push events
triggering webhook, then redispatching to the Downstream CI (different
organization) workflow.

# Terraform configuration

pytorch/ci-infra#415

# Unit Tests

- [x] Unit Tests (Mock)


cc @fffrog

---------

Co-authored-by: KarhouTam <karhou.tam@outlook.com>
Co-authored-by: fffrog <ljw1101.vip@gmail.com>
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Apr 8, 2026
Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation.

The PyTorch repository should be the preferred location for storing the allowlist.yml file.

Downstream repositories are essentially extensions (or plugins) of the PyTorch repository, so they should only need to be aware of PyTorch itself. Infrastructure repositories such as test-infra and ci-infra should remain transparent to downstream repositories and not be directly exposed.
Pull Request resolved: #178681
Approved by: https://github.com/ZainRizvi, https://github.com/albanD
- **Upstream/downstream decoupling:** Downstream repos **only need to install this App to join the cross-repo CI coordination**. Downstream repos do not need an upstream token, and the upstream does not need to know about the downstream. All interactions are bridged through the GitHub App and Relay Server.

> \[!NOTE\]
> This GitHub App should be created under the `pytorch` organization and owned by the PyTorch team or the LF AI & Data Foundation team, to ensure credibility. An App created by a third party will face trust issues during installation and adoption.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LF AI & Data Foundation should be "LF Pytorch Foundation" team.

LF AI & Data Foundation is a different foundation under the left separate from the PyTorch Foundation.

@fffrog fffrog mentioned this pull request Apr 14, 2026
13 tasks
github-merge-queue bot pushed a commit to pytorch/ci-infra that referenced this pull request Apr 14, 2026
## Note

Due to the restrictions on secret injection in the fork Repo scenario on
GitHub, a new PR needs to be created to replace the old one
(#415). Please refer to the old
PR for a detailed discussion.

## Summary

Please refer to this
[comment](pytorch/rfcs#90 (comment))
for the overall implementation.

- Add Terraform infrastructure for CRCR (Cross-Repository CI Relay), a
GitHub webhook relay service for PyTorch out-of-tree backends that
receives upstream webhook events via a GitHub App and forwards
`repository_dispatch` events to registered downstream repositories
- Infrastructure includes: Lambda function (webhook handler),
ElastiCache Redis (allowlist caching), dedicated VPC, IAM roles, and
Lambda Function URL
- Add two GitHub Actions workflows: `crcr-on-pr.yml` and
`crcr-deploy-prod.yml`

**Notes:**

This PR need to wait
[this](pytorch/test-infra#7847) merged first for
purpose of updating tag field in Terrafile.

## Architecture

GitHub App → Lambda webhook (Function URL) → `repository_dispatch` →
downstream repos

**AWS Resources (us-east-1, account 391835788720):**
- Lambda function (`cross_repo_ci_webhook`) with Python 3.10 runtime
- ElastiCache Redis replication group for allowlist caching
- VPC with private subnets for Lambda ↔ Redis connectivity
- IAM role with Secrets Manager, VPC networking, and CloudWatch Logs
permissions
- S3 backend for Terraform state

## Test

Multiple deployments and verifications have been completed on personal
AWS environment.

---------

Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants