[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends by fffrog · Pull Request #90 · pytorch/rfcs

fffrog · 2026-03-10T07:00:25Z

This RFC has been under discussion for several weeks, you can visit this link to see previous discussions if you are interesed in.

And thanks a lot @ZainRizvi, @seemethere, @afrittoli, @zxiiro, @mikaylagawarecki, and @jewelkm89 for the valuable suggestions.

Click here to see a preview of this RFC.

…trean repo

albanD

Great proposal!

Left a few comments, but the general architecture sounds great to me.
Most of my comments are about setting up the specific rules and doesn't block the first steps!

albanD · 2026-03-13T19:08:15Z

+
+The allowlist is designed to naturally support gradual progression from experimental participation to mature participation. The table below lists the requirements for advancing to each level.
+
+| Phase | Level | Requirements |


For the requirements, it might be helpful to separate Infra availability vs legitimate test breakage.

I think the requirement we want to have here is both:

Very strong requirement on Infra availability.

More relaxed requirement on test breakage

We want to encourage both for sure, but they will be managed very differently so we most likely want to provide signal to backend writers independently.

Excellent point. Will update the requirements to separate two dimensions

fffrog · 2026-03-18T02:05:26Z

Left a few comments, but the general architecture sounds great to me.
Most of my comments are about setting up the specific rules and doesn't block the first steps!

Hi @albanD, so happy to get your approval, thank you.

The initial code for L1 and L2 is complete, and I will submit a PR soon. I'll let you know when it's finished.

…ckends

fffrog · 2026-03-18T13:15:21Z

Hey @albanD, the new commit is ready, please help to review it again, thank you.

ZainRizvi · 2026-03-24T15:41:06Z

@fffrog @can-gaa-hou, a couple follow up clarifications after looking at the pytorch/test-infra#7847 in addition to the comments left on the PR

Let's split out implementation into separate PRs for Phase 1 and Phase 2, to keep the added complexity of phase 2 from blocking phase 1 support
Data flow for Phase 1 should be Github webhook -> AWS Lambda -> receiving repos.

Phase 2 should expose:

A receiver lambda on AWS that just validates the incoming request and forwards it to HUD. We add this hop because requests to HUD require an extra security header to avoid bot spam, and this hop will let us avoid having to share that header broadly (and let us rotate it quickly in the future if needed).
An endpoint on HUD (not an AWS lambda) to receive CI results. HUD remains the central processing location for any ClickHouse data
Keeping with the principle of getting clickhouse data to be updated through HUD, when we want to log into clickhouse that a run has been started (for phase 2 repos) we should make an async call to HUD from the AWS lambda.

albanD

Sounds good to me!
Also I agree with Zain about splitting to make sure we can land L1 asap!

fffrog · 2026-03-25T13:03:17Z

Also I agree with Zain about splitting to make sure we can land L1 asap!

@albanD, thank you for your approval. We have been quite busy lately, but we will do our best to complete L1 by March 28th, and sorry again for the delay.

fffrog · 2026-03-25T13:08:00Z

Hey @ZainRizvi

Let's split out implementation into separate PRs for Phase 1 and Phase 2, to keep the added complexity of phase 2 from blocking phase 1 support

Got it, thank you. We will implement the L1 plan as soon as possible. We sincerely apologize for any delays caused by our busy internal affairs.

Data flow for Phase 1 should be Github webhook -> AWS Lambda -> receiving repos.

Gotcha, thank you.

Phase 2 should expose:

A receiver lambda on AWS that just validates the incoming request and forwards it to HUD. We add this hop because requests to HUD require an extra security header to avoid bot spam, and this hop will let us avoid having to share that header broadly (and let us rotate it quickly in the future if needed).
An endpoint on HUD (not an AWS lambda) to receive CI results. HUD remains the central processing location for any ClickHouse data
Keeping with the principle of getting clickhouse data to be updated through HUD, when we want to log into clickhouse that a run has been started (for phase 2 repos) we should make an async call to HUD from the AWS lambda.

I knew, thank you for the detailed explanation, and we will start the L2 developement once the L1 is completed

albanD · 2026-03-27T15:19:07Z

Merging!
We can do update and follow ups via other PRs if needed.

fffrog · 2026-03-28T13:14:08Z

All PRs for L1 of this RFC

To implement L1 of this RFC, we have submitted the following three PRs across different repositories to set up the necessary infrastructure and features:

Add allowlist.yml file for Cross Repo CI Relay pytorch#178681: Added allowlist.yml to define dispatch information.
Add GitHub Action for checking out upstream PyTorch PRs pytorch#178750: Added new action named checkout-upstream-pr to faciliate the integration of CI for downstream repo.
Add Cross Repository CI Relay (CRCR) infrastructure ci-infra#415: Provisioned the required Terraform infrastructure to support deployment.
Implement initial L1 cross-repo CI relay test-infra#7847: Implemented the core webhook logic for processing events.

These PRs are interconnected and collectively fulfill the implemetation outlined in the RFC.

cc @albanD @ZainRizvi @zxiiro

Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation. The PyTorch repository should be the preferred location for storing the allowlist.yml file. Downstream repositories are essentially extensions (or plugins) of the PyTorch repository, so they should only need to be aware of PyTorch itself. Infrastructure repositories such as test-infra and ci-infra should remain transparent to downstream repositories and not be directly exposed. ghstack-source-id: a936347 Pull-Request: #178681

Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation. The PyTorch repository should be the preferred location for storing the allowlist.yml file. Downstream repositories are essentially extensions (or plugins) of the PyTorch repository, so they should only need to be aware of PyTorch itself. Infrastructure repositories such as test-infra and ci-infra should remain transparent to downstream repositories and not be directly exposed. ghstack-source-id: b187886 Pull-Request: #178681

Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation. The PyTorch repository should be the preferred location for storing the allowlist.yml file. Downstream repositories are essentially extensions (or plugins) of the PyTorch repository, so they should only need to be aware of PyTorch itself. Infrastructure repositories such as test-infra and ci-infra should remain transparent to downstream repositories and not be directly exposed. ghstack-source-id: 90e2e71 Pull-Request: #178681

Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation. The PyTorch repository should be the preferred location for storing the allowlist.yml file. Downstream repositories are essentially extensions (or plugins) of the PyTorch repository, so they should only need to be aware of PyTorch itself. Infrastructure repositories such as test-infra and ci-infra should remain transparent to downstream repositories and not be directly exposed. ghstack-source-id: 052e046 Pull-Request: #178681

@can-gaa-hou

## Author - @can-gaa-hou - @KarhouTam # Summary Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation. This PR implements the initial version of the cross-repository CI relay described in [[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends](pytorch/rfcs#90). The current implementation focuses on the first two levels defined in the RFC: - `L1`: downstream repos can be onboarded and triggered through the relay Higher-level behaviors for `L2`, `L3`, and `L4` are intentionally left for follow-up work. # Architecture The relay is split into two AWS Lambda functions: - `webhook_handler` - [x] receives GitHub webhook PR and push events from the upstream repo - [x] validates webhook signatures and authenticates with AWS Secret Manager - [x] reads the downstream whitelist from the URL and stores it in Redis - [x] for `create`/`reopen`/`synchronize` actions, forwards repository_dispatch events to downstream repos # Changes ```md .github/workflows/ ├── cross-repo-ci-relay-tests.yml # CI workflow for cross-repo-ci-relay └──_lambda-do-release-runners.yml # Add cross-repo-ci-relay release workflow aws/lambda/cross_repo_ci_relay/ ├── tests/ # Unit tests for cross-repo-ci-relay ├── README.md # project overview, env vars, build/deploy, and callback usage ├── Makefile # build, package, deploy, and clean commands for Lambda ├── allowlist.py # Functions to handle the allowlist from GitHub ├── config.py # shared runtime config loading ├── utils.py # shared utility helpers and common exceptions ├── redis_helper.py # Redis helpers for whitelist cache ├── lambda_function.py # Lambda entrypoint for GitHub webhook requests ├── gh_helper.py # GitHub App / repository_dispatch client helpers ├── event_handler.py # Functions to handle PR and push events ├── local_server.py # For local tests, see README.md └── requirements.txt # Python dependencies for the webhook Lambda package ``` # Usage See README.md for more details. # Verification We performed the following scenario verification on our AWS Lambda instance: - [x] Test with Upstream PR create/reopen/synchronize and push events triggering webhook, then redispatching to the Downstream CI (different organization) workflow. # Terraform configuration pytorch/ci-infra#415 # Unit Tests - [x] Unit Tests (Mock) cc @fffrog --------- Co-authored-by: KarhouTam <karhou.tam@outlook.com> Co-authored-by: fffrog <ljw1101.vip@gmail.com>

Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation. The PyTorch repository should be the preferred location for storing the allowlist.yml file. Downstream repositories are essentially extensions (or plugins) of the PyTorch repository, so they should only need to be aware of PyTorch itself. Infrastructure repositories such as test-infra and ci-infra should remain transparent to downstream repositories and not be directly exposed. Pull Request resolved: #178681 Approved by: https://github.com/ZainRizvi, https://github.com/albanD

zxiiro · 2026-04-08T15:45:03Z

+- **Upstream/downstream decoupling:** Downstream repos **only need to install this App to join the cross-repo CI coordination**. Downstream repos do not need an upstream token, and the upstream does not need to know about the downstream. All interactions are bridged through the GitHub App and Relay Server.
+
+> \[!NOTE\]
+> This GitHub App should be created under the `pytorch` organization and owned by the PyTorch team or the LF AI & Data Foundation team, to ensure credibility. An App created by a third party will face trust issues during installation and adoption.


LF AI & Data Foundation should be "LF Pytorch Foundation" team.

LF AI & Data Foundation is a different foundation under the left separate from the PyTorch Foundation.

## Note Due to the restrictions on secret injection in the fork Repo scenario on GitHub, a new PR needs to be created to replace the old one (#415). Please refer to the old PR for a detailed discussion. ## Summary Please refer to this [comment](pytorch/rfcs#90 (comment)) for the overall implementation. - Add Terraform infrastructure for CRCR (Cross-Repository CI Relay), a GitHub webhook relay service for PyTorch out-of-tree backends that receives upstream webhook events via a GitHub App and forwards `repository_dispatch` events to registered downstream repositories - Infrastructure includes: Lambda function (webhook handler), ElastiCache Redis (allowlist caching), dedicated VPC, IAM roles, and Lambda Function URL - Add two GitHub Actions workflows: `crcr-on-pr.yml` and `crcr-deploy-prod.yml` **Notes:** This PR need to wait [this](pytorch/test-infra#7847) merged first for purpose of updating tag field in Terrafile. ## Architecture GitHub App → Lambda webhook (Function URL) → `repository_dispatch` → downstream repos **AWS Resources (us-east-1, account 391835788720):** - Lambda function (`cross_repo_ci_webhook`) with Python 3.10 runtime - ElastiCache Redis replication group for allowlist caching - VPC with private subnets for Lambda ↔ Redis connectivity - IAM role with Secrets Manager, VPC networking, and CloudWatch Logs permissions - S3 backend for Terraform state ## Test Multiple deployments and verifications have been completed on personal AWS environment. --------- Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com>

add new RFC for PyTorch to enable CI relay from PyTorch repo to downs…

94e303e

…trean repo

meta-cla bot added the cla signed label Mar 10, 2026

fffrog mentioned this pull request Mar 10, 2026

[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends pytorch/pytorch#175022

Open

update workflow reference in downstream

e265fa6

albanD reviewed Mar 13, 2026

View reviewed changes

This was referenced Mar 18, 2026

[WIP] Implement initial L1/L2 cross-repo CI relay cosdt/test-infra#11

Closed

Implement initial L1 cross-repo CI relay pytorch/test-infra#7847

Merged

update RFC named Cross-Repository CI Relay for PyTorch Out-of-Tree Ba…

5e13847

…ckends

albanD approved these changes Mar 24, 2026

View reviewed changes

albanD merged commit e452220 into pytorch:master Mar 27, 2026
1 check passed

This was referenced Mar 28, 2026

Add allowlist.yml file for Cross Repo CI Relay pytorch/pytorch#178681

Closed

Add Cross Repository CI Relay (CRCR) infrastructure pytorch/ci-infra#415

Closed

KarhouTam mentioned this pull request Mar 30, 2026

Add GitHub Action for checking out upstream PyTorch PRs pytorch/pytorch#178750

Closed

ZainRizvi mentioned this pull request Apr 8, 2026

L1 to L4 design for Cross-Repository CI Relay pytorch/test-infra#7937

Open

zxiiro reviewed Apr 8, 2026

View reviewed changes

KarhouTam mentioned this pull request Apr 10, 2026

Cross-Repo-CI-Relay L2 implementation & L1 refactor cosdt/test-infra#41

Closed

fffrog mentioned this pull request Apr 10, 2026

Add Cross Repository CI Relay (CRCR) infrastructure(L1 Only) pytorch/ci-infra#433

Merged

can-gaa-hou mentioned this pull request Apr 14, 2026

Implement initial L2 for CRCR cosdt/test-infra#43

Open

13 tasks

fffrog mentioned this pull request Apr 14, 2026

.... cosdt/test-infra#45

Open

13 tasks

KarhouTam mentioned this pull request Apr 14, 2026

[WIP][CRCR] Initial implementation of L2 pytorch/test-infra#7967

Draft

12 tasks


		The allowlist is designed to naturally support gradual progression from experimental participation to mature participation. The table below lists the requirements for advancing to each level.

		\| Phase \| Level \| Requirements \|

Conversation

fffrog commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

albanD Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

fffrog Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fffrog commented Mar 18, 2026

Uh oh!

fffrog commented Mar 18, 2026

Uh oh!

ZainRizvi commented Mar 24, 2026

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

fffrog commented Mar 25, 2026

Uh oh!

fffrog commented Mar 25, 2026

Uh oh!

albanD commented Mar 27, 2026

Uh oh!

Uh oh!

fffrog commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

All PRs for L1 of this RFC

Uh oh!

zxiiro Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fffrog commented Mar 10, 2026 •

edited

Loading

fffrog commented Mar 28, 2026 •

edited

Loading