This repository contains a reference architecture that creates and helps you secure a Google Cloud environment that is ready for you to implement Federated Learning (FL) use cases on Google Cloud.
This reference architecture is aimed at cloud platform administrators and data scientists that need to provision and configure a secure environment to run FL workloads in their Google Cloud environment.
This reference architecture implements controls that you can use to help configure and secure your Google Cloud environment to host FL workloads. Workloads are considered as untrusted within the Google Cloud environment. Therefore, the cluster is configured according to isolate and FL workloads from other workloads and from the cluster control plane. We recommend that you grant only the required permissions that FL workloads need in order to work as designed.
This reference architecture provisions resources on Google Cloud. The runtime environment is based on Google Kubernetes Engine (GKE). After the initial provisioning, you can extended the infrastructure to GKE clusters running on premises or on other public clouds.
This reference architecture assumes that you are familiar with GKE and Kubernetes.
To deploy this reference architecture you need:
- A Google Cloud project with billing enabled.
- An account with either the Project Owner role (full access) or Granular Access roles.
- The
serviceusage.googleapis.commust be enabled on the project. For more information about enabling APIs, see Enabling and disabling services - A Git repository to store the environment configuration.
You can choose between Project Owner access or Granular Access for more fine-tuned permissions.
The service account will have full administrative access to the project.
roles/owner: Full administrative access to the project (Project Owner role)
The service account will be assigned the following roles to limit access to required resources:
roles/artifactregistry.admin: Grants full administrative access to Artifact Registry, allowing management of repositories and artifacts.roles/browser: Provides read-only access to browse resources in a project.roles/cloudkms.admin: Provides full administrative control over Cloud KMS (Key Management Service) resources.roles/compute.networkAdmin: Grants full control over Compute Engine network resources.roles/container.clusterAdmin: Provides full control over Kubernetes Engine clusters, including creating and managing clusters.roles/gkehub.editor: Grants permission to manage Google Kubernetes Engine Hub features.roles/iam.serviceAccountAdmin: Grants full control over managing service accounts in the project.roles/resourcemanager.projectIamAdmin: Allows managing IAM policies and roles at the project level.roles/servicenetworking.serviceAgent: Allows managing service networking configurations.roles/serviceusage.serviceUsageAdmin: Grants permission to enable and manage services and APIs for a project.
The following diagram describes the architecture that you create with this reference architecture:
As shown in the preceding diagram, the reference architecture helps you create and configure the following infrastructure components:
-
A Virtual Private Cloud (VPC) network and subnets.
-
A private GKE cluster that helps you:
- Isolate cluster nodes from the internet.
- Limit exposure of your cluster nodes and control plane to the internet.
- Use shielded GKE nodes.
- Enable Dataplane V2 for optimized Kubernetes networking.
- Encrypt cluster secrets at the application layer.
-
Dedicated GKE node pools to isolate workloads from each other in dedicated runtime environments.
-
For each GKE node pool, the reference architecture creates a dedicated Kubernetes namespace. The Kubernetes namespace and its resources are treated as a tenant within the GKE cluster.
-
For each GKE node, the reference architecture configures Kubernetes taints to ensure that only the tenant's workloads are schedulable onto the GKE nodes belonging to a particular tenant.
-
A GKE node pool (
system) to host coordination and management workloads that aren't tied to specific tenants. -
Firewall policies to limit ingress and egress traffic from GKE node pools, unless explicitly allowed.
-
Cloud NAT to allow egress traffic to the internet, only if allowed.
-
Cloud DNS records to enable Private Google Access such that workloads within the cluster can access Google APIs without traversing the internet.
-
Cloud Identity and Access Management (IAM) service accounts:
- A service account for GKE nodes in each GKE node pool with only the minimum amount of permissions needed by GKE.
- A service account for workloads in each tenant. These service don't have any permission by default, and map to Kubernetes service accounts using Workload Identity for GKE.
-
An Artifact Registry repository to store container images for your workloads.
-
Config Sync to sync cluster configuration and policies from a Git repository or an OCI-compliant repository. Users and teams managing workloads should not have permissions to change cluster configuration or modify service mesh resources unless explicitly allowed by your policies.
-
An Artifact Registry repository to store Config Sync configurations.
-
Policy Controller to enforce policies on resources in the GKE cluster to help you isolate workloads.
-
Cloud Service Mesh to control and help secure network traffic.
Config Sync applies the following Policy controller and Cloud Service Mesh controls to each Kubernetes namespace:
- By default, deny all ingress and egress traffic to and from pods. This rule acts as baseline 'deny all' rule.
- Allow egress traffic to required cluster resources such as the GKE control plane.
- Allow egress traffic only to known hosts.
- Allow ingress and egress traffic that originate from within the same namespace.
- Allow ingress and egress traffic between pods in the same namespace.
- Allow egress traffic to Google APIs only using Private Google Access.
-
Open Cloud Shell
-
Clone this Git repository, including submodules:
git clone --recurse-submodules git@github.com:GoogleCloudPlatform/federated-learning.git
If you prefer cloning using the repository web URL, run this command instead:
git clone --recurse-submodules https://github.com/GoogleCloudPlatform/federated-learning.git
-
Change into the directory where you cloned this repository:
cd federated-learning -
Configure the ID of the Google Cloud project where you want to initialize the provisioning and configuration environment. This project will also contain the remote Terraform backend. Add the following content to
accelerated-platforms/platforms/gke/base/_shared_config/terraform.auto.tfvars:terraform_project_id = "<CONFIG_PROJECT_ID>"
Where:
<CONFIG_PROJECT_ID>is the Google Cloud project ID.
-
Configure the ID of the Google Cloud project where you want to deploy the reference architecture by adding the following content to
accelerated-platforms/platforms/gke/base/_shared_config/cluster.auto.tfvars:cluster_project_id = "<PROJECT_ID>"
Where:
<PROJECT_ID>is the Google Cloud project ID. Can be different from<CONFIG_PROJECT_ID>.
-
Optionally configure a unique identifier to append to the name of all the resources in the reference architecture to identify a particular instance of the reference architecture, and to allow for multiple instances of the reference architecture to be deployed in the same Google Cloud project. To optionally configure the unique prefix, add the following content to
accelerated-platforms/platforms/gke/base/_shared_config/platform.auto.tfvars:resource_name_prefix = "<RESOURCE_NAME_PREFIX>" platform_name = "<PLATFORM_NAME>"
Where:
<RESOURCE_NAME_PREFIX>and<PLATFORM_NAME>are strings that compose the unique identifier to append to the name of all the resources in the reference architecture.
When you set
resource_name_prefixandplatform_name, we recommend that you avoid long strings because the might make resource naming validation to fail because the resource name might be too long. -
Run the script to deploy the reference architecture:
accelerated-platforms/platforms/gke/base/use-cases/federated-learning/deploy.sh
It takes about 20 minutes to provision the reference architecture.
After deploying the reference architecture completes, the GKE cluster is ready to host FL workloads. To familiarize with the environment that you provisioned, you can also deploy the following examples in the GKE cluster:
Federated computation in FL use cases is typically defined as either:
- Cross-silo federated computation is where the participating members are organizations or companies, and the number of members is usually small (e.g., within a hundred). You can realize cross-silo FL use cases by configuring this reference architecture, and by deploying FL workloads in the GKE cluster that this reference architecture provides.
- Cross-device computation is a type of federated computation where the participating members are end user devices such as mobile phones and vehicles. The number of members can reach up to a scale of millions or even tens of millions.
For more information about configuring this reference architecture, see Configure the Federated Learning reference architecture.
For more information about how to troubleshoot issue, see Troubleshooting the reference architecture.
For more information about the controls that this reference architecture implements to help you secure your environment, see GKE security controls.
For a complete overview about how to implement Federated Learning on Google Cloud, see Cross-silo and cross-device federated learning on Google Cloud.
