Skip to content

Add Cross Repository CI Relay (CRCR) infrastructure(L1 Only)#433

Merged
zxiiro merged 9 commits intomainfrom
crcr-l1
Apr 14, 2026
Merged

Add Cross Repository CI Relay (CRCR) infrastructure(L1 Only)#433
zxiiro merged 9 commits intomainfrom
crcr-l1

Conversation

@fffrog
Copy link
Copy Markdown
Collaborator

@fffrog fffrog commented Apr 10, 2026

Note

Due to the restrictions on secret injection in the fork Repo scenario on GitHub, a new PR needs to be created to replace the old one (#415). Please refer to the old PR for a detailed discussion.

Summary

Please refer to this comment for the overall implementation.

  • Add Terraform infrastructure for CRCR (Cross-Repository CI Relay), a GitHub webhook relay service for PyTorch out-of-tree backends that receives upstream webhook events via a GitHub App and forwards repository_dispatch events to registered downstream repositories
  • Infrastructure includes: Lambda function (webhook handler), ElastiCache Redis (allowlist caching), dedicated VPC, IAM roles, and Lambda Function URL
  • Add two GitHub Actions workflows: crcr-on-pr.yml and crcr-deploy-prod.yml

Notes:

This PR need to wait this merged first for purpose of updating tag field in Terrafile.

Architecture

GitHub App → Lambda webhook (Function URL) → repository_dispatch → downstream repos

AWS Resources (us-east-1, account 391835788720):

  • Lambda function (cross_repo_ci_webhook) with Python 3.10 runtime
  • ElastiCache Redis replication group for allowlist caching
  • VPC with private subnets for Lambda ↔ Redis connectivity
  • IAM role with Secrets Manager, VPC networking, and CloudWatch Logs permissions
  • S3 backend for Terraform state

Test

Multiple deployments and verifications have been completed on personal AWS environment.

@fffrog
Copy link
Copy Markdown
Collaborator Author

fffrog commented Apr 10, 2026

Hi @zxiiro, I apologize for bothering you again.

The failed job named TFLint & Plan - CRCR / CRCR tflint + terraform plan was due to missing S3 and DynamoDB. Therefore, we need to manually create S3 and DynamoDB first, and then ensure that the IAM role has access to the new bucket. After that, I think the job should succeed.

Could you help me create S3/DynamoDB? section should be a good reference. Thank you very much for your help!

@zxiiro
Copy link
Copy Markdown
Collaborator

zxiiro commented Apr 10, 2026

Hi @zxiiro, I apologize for bothering you again.

The failed job named TFLint & Plan - CRCR / CRCR tflint + terraform plan was due to missing S3 and DynamoDB. Therefore, we need to manually create S3 and DynamoDB first, and then ensure that the IAM role has access to the new bucket. After that, I think the job should succeed.

Could you help me create S3/DynamoDB? section should be a good reference. Thank you very much for your help!

I think its failing because the bucket tfstate-pyt-crcr-prod already exists. I'm not able to create it manually either it says bucket already exists but i don't see it in the PyTorch Foundation account.

So In AWS S3 buckets are globally unique. If your team created a bucket with the same name in your own account when testing then we cannot create this bucket name because you've already used it. You'll need to delete your bucket in your test account to release the name for us to deploy or we will need to update the terraform config in this PR to use a different bucket name.

@fffrog fffrog force-pushed the crcr-l1 branch 2 times, most recently from 5a4b5ef to c303f08 Compare April 11, 2026 13:11
@fffrog
Copy link
Copy Markdown
Collaborator Author

fffrog commented Apr 11, 2026

I think its failing because the bucket tfstate-pyt-crcr-prod already exists. I'm not able to create it manually either it says bucket already exists but i don't see it in the PyTorch Foundation account.

Thanks for pointing this out. I was previously unaware that AWS S3 buckets are unique across different accounts, which is a bit unbelievable to me :D.

You'll need to delete your bucket in your test account to release the name for us to deploy or we will need to update the terraform config in this PR to use a different bucket name.

I prefer to keep the name tfstate-pyt-crcr-prod for production, so I have already freed the name from my personal AWS account, and have confirmed that the name has been freed by rerunning the failed job. If you have time, please help create the S3 and DynamoDB instances; thank you very much.

@fffrog
Copy link
Copy Markdown
Collaborator Author

fffrog commented Apr 11, 2026

Hi @ZainRizvi @zxiiro, sorry to bother you again.

I’ve introduced a few additional changes in my latest commits. Below is a brief summary of the rationale behind them. Please let me know if you have any feedback.

Change 1: Sync with Upstream VPC Module

  • Rationale: Based on the discussion here.
  • Modification: Replaced jeanschmidt/terraform-aws-vpc with the official upstream terraform-aws-modules/terraform-aws-vpc (v6.6.1).

Change 2: Simplify Lambda Permissions

  • Rationale: With the AWS provider upgraded to v6.28+ (required by the new VPC module), we can now use native invoked_via_function_url support.
  • Modification: Replaced the previous aws_cloudformation_stack workaround with native aws_lambda_permission resources to simplify the stack.

Change 3: Flatten Directory Structure

  • Rationale:
    • The previous aws/<account>/<region>/ structure added unnecessary complexity for canary testing in personal accounts (requiring manual renaming of S3 buckets and directories to avoid conflicts).
    • Given that CRCR does not require multi-region deployment, a flatter structure is more maintainable.
  • Modification: Flattened the directory structure to aws/.

@zxiiro
Copy link
Copy Markdown
Collaborator

zxiiro commented Apr 13, 2026

I prefer to keep the name tfstate-pyt-crcr-prod for production, so I have already freed the name from my personal AWS account, and have confirmed that the name has been freed by rerunning the failed job. If you have time, please help create the S3 and DynamoDB instances; thank you very much.

Alright S3 and DynamoDB has been created. The Terraform Plan jobs appear to be working now. I'll take a look at your newer changes later today.

@fffrog
Copy link
Copy Markdown
Collaborator Author

fffrog commented Apr 13, 2026

Alright S3 and DynamoDB has been created. The Terraform Plan jobs appear to be working now. I'll take a look at your newer changes later today.

Thank you lot, and the CI have passed :D

Please let me know if you have any questions and we have carefully tested in my personal AWS with the environment suffix "canary" 😀

Copy link
Copy Markdown
Contributor

@ZainRizvi ZainRizvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

Copy link
Copy Markdown
Collaborator

@zxiiro zxiiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. I have 2 suggestions if you want to handle them in this PR; otherwise I will merge this tomorrow during my daytime.

fffrog added a commit that referenced this pull request Apr 14, 2026
As the title stated.
fffrog and others added 9 commits April 14, 2026 11:44
**Summary**:

- Add Terraform infrastructure for CRCR (Cross-Repository CI Relay), a GitHub webhook relay service for PyTorch out-of-tree backends that receives upstream webhook events via a GitHub App and forwards `repository_dispatch` events to registered downstream repositories
- Infrastructure includes: Lambda function (webhook handler), ElastiCache Redis (allowlist caching), dedicated VPC, IAM roles, and Lambda Function URL
- Add two GitHub Actions workflows: `crcr-on-pr.yml` and `crcr-deploy-prod.yml`

**Architecture**:

GitHub App → Lambda webhook (Function URL) → `repository_dispatch` → downstream repos

- Lambda function (`cross_repo_ci_webhook`) with Python 3.10 runtime
- ElastiCache Redis replication group (`cache.t3.small`) for allowlist caching
- VPC with private subnets for Lambda ↔ Redis connectivity
- IAM role with Secrets Manager, VPC networking, and CloudWatch Logs permissions
- S3 backend for Terraform state

**Test**:

Multiple deployments and verifications have been completed on personal AWS environment.

Co-authored-by: can-gaa-hou <jiahaochen535@gmail.com>
- Fix some bugs about elasticache
- update README.md to correspond to the crcr code.
- creating secret manager via terraform rather than creating manually
- Moving REDIS_LOGIN from environment to secret manager
- Optimize all terraform label and name to be clearer and more understanding
As the title stated.
- Replaced jeanschmidt/terraform-aws-vpc with the upstream terraform-aws-modules/terraform-aws-vpc (v6.6.1, the latest release).
- Since the AWS provider was bumped to >= 6.28 (required by the new VPC module), invoked_via_function_url is now natively supported. Replaced the aws_cloudformation_stack workaround with native aws_lambda_permission resources.
- Flattened the directory structure from aws/<account>/<region>/ to aws/ for two reasons: first, CRCR's canary environment is deployed under personal AWS account, and the previous structure required code changes just to switch accounts; second, CRCR doesn't need cross-region deployment, so the nested structure added unnecessary complexity.
As the title stated.
@zxiiro zxiiro added this pull request to the merge queue Apr 14, 2026
Merged via the queue into main with commit d059e98 Apr 14, 2026
12 checks passed
@zxiiro zxiiro deleted the crcr-l1 branch April 14, 2026 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants