From e7da5e6717383df31d2687f6cbbb88a59a33d1f2 Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Tue, 24 Nov 2020 14:26:54 -0800 Subject: [PATCH 01/17] rewrite containers docs for custom containers --- .../www/site/content/en/contribute/_index.md | 2 +- .../en/documentation/runtime/environments.md | 247 +++++++++++------- 2 files changed, 156 insertions(+), 93 deletions(-) diff --git a/website/www/site/content/en/contribute/_index.md b/website/www/site/content/en/contribute/_index.md index 4a095b70ba38..333039f39a4f 100644 --- a/website/www/site/content/en/contribute/_index.md +++ b/website/www/site/content/en/contribute/_index.md @@ -135,7 +135,7 @@ script which is part of the Beam repo: ([template](https://s.apache.org/beam-design-doc-template), [examples](https://s.apache.org/beam-design-docs)) and email it to the [dev@ mailing list](/community/contact-us). -### Development Setup +### Development Setup {#development-setup} 1. If you need help with git forking, cloning, branching, committing, pull requests, and squashing commits, see [Git workflow tips](https://cwiki.apache.org/confluence/display/BEAM/Git+Tips) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index ea7cc17d4bfe..0699fd4b7e2b 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -15,56 +15,167 @@ See the License for the specific language governing permissions and limitations under the License. --> -# Container environments +# Container Environments -The Beam SDK runtime environment is isolated from other runtime systems because the SDK runtime environment is [containerized](https://s.apache.org/beam-fn-api-container-contract) with [Docker](https://www.docker.com/). This means that any execution engine can run the Beam SDK. +The Beam SDK runtime environment is [containerized](https://www.docker.com/resources/what-container) with [Docker](https://www.docker.com/) to isolate it from other runtime systems. This means any execution engine can run the Beam SDK. To learn more about the container environment, read the Beam [SDK Harness container contract](https://s.apache.org/beam-fn-api-container-contract). -This page describes how to customize, build, and push Beam SDK container images. +Prebuilt SDK container images are released per supported language version during Beam releases and and pushed to [Docker Hub](https://hub.docker.com/search?q=apache%2Fbeam&type=image) -Before you begin, install [Docker](https://www.docker.com/) on your workstation. +## Custom Containers -## Customizing container images +Users may want to customize container images for many reasons, including: -You can add extra dependencies to container images so that you don't have to supply the dependencies to execution engines. +* pre-installing additional dependencies, +* launching third-party software +* further customizing the execution environment -To customize a container image, either: -* [Write a new](#writing-new-dockerfiles) [Dockerfile](https://docs.docker.com/engine/reference/builder/) on top of the original. -* [Modify](#modifying-dockerfiles) the [original Dockerfile](https://github.com/apache/beam/blob/master/sdks/python/container/Dockerfile) and reimage the container. + This guide describes how to create and use customized containers for the Beam SDK. -It's often easier to write a new Dockerfile. However, by modifying the original Dockerfile, you can customize anything (including the base OS). +### Prerequisites +You will need to have [Docker installed](https://docs.docker.com/get-docker/). -### Writing new Dockerfiles on top of the original {#writing-new-dockerfiles} +In addition, you will need to have a container registry accessible by your execution engine or runner to host a custom container image. Options include [Docker Hub](https://hub.docker.com/) or a "self-hosted" repository, including cloud-specific container registries. -1. Pull a [prebuilt SDK container image](https://hub.docker.com/search?q=apache%2Fbeam&type=image) for your [target](https://docs.docker.com/docker-hub/repos/#searching-for-repositories) language and version. The following example pulls the latest Python SDK: +> **NOTE**: On Nov 20, 2020, Docker Hub put [rate limits](https://www.docker.com/increase-rate-limits) into effect for anonymous and free authenticated use, which may impact larger pipelines that pull containers several times. + +### Building and pushing custom containers + +Beam builds prebuilt images from [Dockerfiles](https://docs.docker.com/engine/reference/builder/). Users can build customized containers in one of two ways: + +1. **[Writing a new](#writing-new-dockerfiles) Dockerfile based on an existing prebuilt container**. This is sufficient for simple additions to the image, such as adding artifacts or environment variables. +2. **[Modifying](#modifying-dockerfiles) an existing Dockerfile in [Beam source](https://github.com/apache/beam)**. This method requires building from Beam source but allows for greater customization of the container (including replacement of artifacts or base OS/language versions). + +#### Writing new Dockerfiles on top of the original {#writing-new-dockerfiles} + +Steps: + +1. Create a new Dockerfile that designates a base image using the [FROM instruction](https://docs.docker.com/engine/reference/builder/#from) + +2. Once you have a created a custom Dockerfile, [build](https://docs.docker.com/engine/reference/commandline/build/) and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker: + +As an example, this `Dockerfile`: + +``` +FROM apache/beam_python3.7_sdk:2.25.0 + +ENV FOO=bar +COPY /src/path/to/file /dest/path/to/file/ ``` -docker pull apache/beam_python3.7_sdk + +uses the prebuilt Python 3.7 SDK container image [`beam_python3.7_sdk`](https://hub.docker.com/r/apache/beam_python3.7_sdk) tagged at (SDK version) `2.25.0`, and adds an additional environment variable and file to the image. + +``` +export BASE_IMAGE="apache/beam_python3.7_sdk:2.25.0" +export IMAGE_NAME="myremoterepo/mybeamsdk" +export TAG="latest" + +# Optional but recommended pull step to pull the base image into your local Docker daemon. +docker pull "${BASE_IMAGE}" +docker build -f Dockerfile -t "${IMAGE_NAME}:${TAG}" . +docker push "${IMAGE_NAME}:${TAG}" ``` -2. [Write a new Dockerfile](https://docs.docker.com/develop/develop-images/dockerfile_best-practices/) that [designates](https://docs.docker.com/engine/reference/builder/#from) the original as its [parent](https://docs.docker.com/glossary/?term=parent%20image). -3. [Build](#building-container-images) a child image. -### Modifying the original Dockerfile {#modifying-dockerfiles} +**NOTE**: After pushing a container image, you should verify the remote image ID and digest should match the local image ID and digest, output from `docker build` or `docker images`. + +#### Modifying the original Dockerfile {#modifying-dockerfiles} in Beam source + +This method will require building image artifacts from Beam source - see the [Contribution guide](contribute/#development-setup) for additional instructions on setting up your development environment. + +1. Clone the `beam` repository. -1. Clone the `beam` repository: ``` git clone https://github.com/apache/beam.git ``` -2. Customize the [Dockerfile](https://github.com/apache/beam/blob/master/sdks/python/container/Dockerfile). If you're adding dependencies from [PyPI](https://pypi.org/), use [`base_image_requirements.txt`](https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt) instead. -3. [Reimage](#building-container-images) the container. -### Testing customized images +2. Customize the `Dockerfile` for a given language. This file is typically in the `sdks//container` directory (e.g. the [Dockerfile for Python](https://github.com/apache/beam/blob/master/sdks/python/container/Dockerfile).. If you're adding dependencies from [PyPI](https://pypi.org/), use [`base_image_requirements.txt`](https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt) instead. + +3. Navigate to the root directory of the local copy of your Apache Beam. + +4. Run Gradle with the `docker` target. + + +``` +# The default repository of each SDK +./gradlew :sdks:java:container:java8:docker +./gradlew :sdks:java:container:java11:docker +./gradlew :sdks:go:container:docker +./gradlew :sdks:python:container:py36:docker +./gradlew :sdks:python:container:py37:docker +./gradlew :sdks:python:container:py38:docker + +# Shortcut for building all Python SDKs +./gradlew :sdks:python:container buildAll +``` + +To examine the containers that you built, run `docker images`: + +``` +$> docker images +REPOSITORY TAG IMAGE ID CREATED SIZE +apache/beam_java8_sdk latest ... 1 min ago ... +apache/beam_java11_sdk latest ... 1 min ago ... +apache/beam_python3.6_sdk latest ... 1 min ago ... +apache/beam_python3.7_sdk latest ... 1 min ago ... +apache/beam_python3.8_sdk latest ... 1 min ago ... +apache/beam_go_sdk latest ... 1 min ago ... +``` + +If you did not provide a custom repo/tag as additional parameters (see below), you can retag the image and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker to a remote repository. + +``` +export IMAGE_NAME="myrepo/mybeamsdk" +export TAG="latest" + +docker tag apache/beam_python3.6_sdk "${IMAGE_NAME}:${TAG}" +docker push "${IMAGE_NAME}:${TAG}" +``` + +**NOTE**: After pushing a container image, verify the remote image ID and digest matches the local image ID and digest output from `docker_images` + +##### Additional Build Parameters + +The docker Gradle task defines a default image repository and [tag](https://docs.docker.com/engine/reference/commandline/tag/) is the SDK version defined at [gradle.properties](https://github.com/apache/beam/blob/master/gradle.properties). The default repository is the Docker Hub `apache` namespace, and the default tag is the [SDK version](https://github.com/apache/beam/blob/master/gradle.properties) defined at gradle.properties. With these settings, the +`docker` command-line tool will implicitly try to push the container to the Docker Hub Apache repository. + +You can specify a different repository or tag for built images by providing parameters to the build task. For example: + +``` +./gradlew :sdks:python:container:py36:docker -Pdocker-repository-root=example-repo -Pdocker-tag=2019-10-04 +``` -To test a customized image locally, run a pipeline with PortableRunner and set the `--environment_config` flag to the image path: +builds the Python 3.6 container and tags it as `example-repo/beam_python3.6_sdk:2019-10-04`. + +From 2.21.0, a `docker-pull-licenses` flag was introduced to add licenses/notices for third party dependencies to the docker images. For example: + +``` +./gradlew :sdks:java:container:java8:docker -Pdocker-pull-licenses +``` +creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_party_licenses/`. + +By default, no licenses/notices are added to the docker images. + + +## Using Container Images in Pipelines + +The common method for providing a container image requires using the PortableRunner and setting the `--environment_config` flag to a given image path. +Other runners, such as Dataflow, support specifying containers with different flags. {{< highlight class="runner-direct" >}} +export IMAGE="my-repo/beam_python_sdk_custom" +export TAG="X.Y.Z" + python -m apache_beam.examples.wordcount \ --input=/path/to/inputfile \ --output /path/to/write/counts \ --runner=PortableRunner \ --job_endpoint=embed \ ---environment_config=path/to/container/image +--environment_config="${IMAGE}:${TAG}" {{< /highlight >}} {{< highlight class="runner-flink-local" >}} +export IMAGE="my-repo/beam_python_sdk_custom" +export TAG="X.Y.Z" + # Start a Flink job server on localhost:8099 ./gradlew :runners:flink:1.8:job-server:runShadow @@ -74,10 +185,13 @@ python -m apache_beam.examples.wordcount \ --output=/path/to/write/counts \ --runner=PortableRunner \ --job_endpoint=localhost:8099 \ ---environment_config=path/to/container/image +--environment_config="${IMAGE}:${TAG}" {{< /highlight >}} {{< highlight class="runner-spark-local" >}} +export IMAGE="my-repo/beam_python_sdk_custom" +export TAG="X.Y.Z" + # Start a Spark job server on localhost:8099 ./gradlew :runners:spark:job-server:runShadow @@ -87,77 +201,26 @@ python -m apache_beam.examples.wordcount \ --output=path/to/write/counts \ --runner=PortableRunner \ --job_endpoint=localhost:8099 \ ---environment_config=path/to/container/image +--environment_config="${IMAGE}:${TAG}" {{< /highlight >}} -## Building container images - -To build Beam SDK container images: - -1. Navigate to the root directory of the local copy of your Apache Beam. -2. Run Gradle with the `docker` target. If you're [building a child image](#writing-new-dockerfiles), set the optional `--file` flag to the new Dockerfile. If you're [building an image from an original Dockerfile](#modifying-dockerfiles), ignore the `--file` flag: - -``` -# The default repository of each SDK -./gradlew [--file=path/to/new/Dockerfile] :sdks:java:container:java8:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:java:container:java11:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:go:container:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container:py2:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container:py35:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container:py36:docker -./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container:py37:docker - -# Shortcut for building all four Python SDKs -./gradlew [--file=path/to/new/Dockerfile] :sdks:python:container buildAll -``` - -From 2.21.0, `docker-pull-licenses` tag was introduced. Licenses/notices of third party dependencies will be added to the docker images when `docker-pull-licenses` was set. -For example, `./gradlew :sdks:java:container:java8:docker -Pdocker-pull-licenses`. The files are added to `/opt/apache/beam/third_party_licenses/`. -By default, no licenses/notices are added to the docker images. - -To examine the containers that you built, run `docker images` from anywhere in the command line. If you successfully built all of the container images, the command prints a table like the following: -``` -REPOSITORY TAG IMAGE ID CREATED SIZE -apache/beam_java8_sdk latest ... 2 weeks ago ... -apache/beam_java11_sdk latest ... 2 weeks ago ... -apache/beam_python2.7_sdk latest ... 2 weeks ago ... -apache/beam_python3.5_sdk latest ... 2 weeks ago ... -apache/beam_python3.6_sdk latest ... 2 weeks ago ... -apache/beam_python3.7_sdk latest ... 2 weeks ago ... -apache/beam_go_sdk latest ... 2 weeks ago ... -``` - -### Overriding default Docker targets - -The default [tag](https://docs.docker.com/engine/reference/commandline/tag/) is sdk_version defined at [gradle.properties](https://github.com/apache/beam/blob/master/gradle.properties) and the default repositories are in the Docker Hub `apache` namespace. -The `docker` command-line tool implicitly [pushes container images](#pushing-container-images) to this location. +{{< highlight class="runner-dataflow" >}} +# Run a pipeline on Dataflow +export IMAGE="my-repo/beam_python_sdk_custom" +export TAG="X.Y.Z" -To tag a local image, set the `docker-tag` option when building the container. The following command tags a Python SDK image with a date. -``` -./gradlew :sdks:python:container:py36:docker -Pdocker-tag=2019-10-04 -``` - -To change the repository, set the `docker-repository-root` option to a new location. The following command sets the `docker-repository-root` -to a repository named `example-repo` on Docker Hub. -``` -./gradlew :sdks:python:container:py36:docker -Pdocker-repository-root=example-repo -``` +export GCS_PATH="gs://my-gcs-bucket" +export GCP_PROJECT="my-gcp-project" +export REGION="us-central1" -## Pushing container images - -After [building a container image](#building-container-images), you can store it in a remote Docker repository. - -The following steps push a Python3.6 SDK image to the [`docker-root-repository` value](#overriding-default-docker-targets). -Please log in to the destination repository as needed. - -Upload it to the remote repository: -``` -docker push example-repo/beam_python3.6_sdk -``` - -To download the image again, run `docker pull`: -``` -docker pull example-repo/beam_python3.6_sdk -``` +python -m apache_beam.examples.wordcount \ + --input gs://dataflow-samples/shakespeare/kinglear.txt \ + --output "${GCS_PATH}/counts" \ + --runner DataflowRunner \ + --project $GCP_PROJECT \ + --region $REGION \ + --temp_location "${GCS_PATH}/tmp/" \ + --experiment=use_runner_v2 \ + --worker_harness_container_image="${IMAGE}:${TAG}" -> **Note**: After pushing a container image, the remote image ID and digest match the local image ID and digest. +{{< /highlight >}} From 9c1dc3c879ba820fdfe06a2d9baa64528c395261 Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Tue, 24 Nov 2020 14:37:55 -0800 Subject: [PATCH 02/17] revisions --- .../content/en/documentation/runtime/environments.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index 0699fd4b7e2b..e50b4fbce9b5 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -32,9 +32,10 @@ Users may want to customize container images for many reasons, including: This guide describes how to create and use customized containers for the Beam SDK. ### Prerequisites -You will need to have [Docker installed](https://docs.docker.com/get-docker/). -In addition, you will need to have a container registry accessible by your execution engine or runner to host a custom container image. Options include [Docker Hub](https://hub.docker.com/) or a "self-hosted" repository, including cloud-specific container registries. +* You will need to have a version of the Beam SDK >= 2.21.0. +* You will need to have [Docker installed](https://docs.docker.com/get-docker/). +* You will need to have a container registry accessible by your execution engine or runner to host a custom container image. Options include [Docker Hub](https://hub.docker.com/) or a "self-hosted" repository, including cloud-specific container registries. > **NOTE**: On Nov 20, 2020, Docker Hub put [rate limits](https://www.docker.com/increase-rate-limits) into effect for anonymous and free authenticated use, which may impact larger pipelines that pull containers several times. @@ -205,7 +206,6 @@ python -m apache_beam.examples.wordcount \ {{< /highlight >}} {{< highlight class="runner-dataflow" >}} -# Run a pipeline on Dataflow export IMAGE="my-repo/beam_python_sdk_custom" export TAG="X.Y.Z" @@ -213,6 +213,9 @@ export GCS_PATH="gs://my-gcs-bucket" export GCP_PROJECT="my-gcp-project" export REGION="us-central1" +# Run a pipeline on Dataflow. +# This is a Python batch pipeline, so to run on Dataflow Runner V2 +# you must specify the experiment "use_runner_v2" python -m apache_beam.examples.wordcount \ --input gs://dataflow-samples/shakespeare/kinglear.txt \ --output "${GCS_PATH}/counts" \ From 290e2206aa08e51b71226fb203899a3f0a1cf077 Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Mon, 30 Nov 2020 15:07:06 -0800 Subject: [PATCH 03/17] review edits --- .../en/documentation/runtime/environments.md | 54 ++++++++++--------- 1 file changed, 29 insertions(+), 25 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index e50b4fbce9b5..29b173e20661 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -15,46 +15,41 @@ See the License for the specific language governing permissions and limitations under the License. --> -# Container Environments +# Container environments The Beam SDK runtime environment is [containerized](https://www.docker.com/resources/what-container) with [Docker](https://www.docker.com/) to isolate it from other runtime systems. This means any execution engine can run the Beam SDK. To learn more about the container environment, read the Beam [SDK Harness container contract](https://s.apache.org/beam-fn-api-container-contract). -Prebuilt SDK container images are released per supported language version during Beam releases and and pushed to [Docker Hub](https://hub.docker.com/search?q=apache%2Fbeam&type=image) +Prebuilt SDK container images are released per supported language during Beam releases and pushed to [Docker Hub](https://hub.docker.com/search?q=apache%2Fbeam&type=image) -## Custom Containers +## Custom containers -Users may want to customize container images for many reasons, including: +You may want to customize container images for many reasons, including: -* pre-installing additional dependencies, -* launching third-party software -* further customizing the execution environment +* Pre-installing additional dependencies, +* Launching third-party software +* Further customizing the execution environment This guide describes how to create and use customized containers for the Beam SDK. ### Prerequisites -* You will need to have a version of the Beam SDK >= 2.21.0. * You will need to have [Docker installed](https://docs.docker.com/get-docker/). -* You will need to have a container registry accessible by your execution engine or runner to host a custom container image. Options include [Docker Hub](https://hub.docker.com/) or a "self-hosted" repository, including cloud-specific container registries. +* You will need to have a container registry accessible by your execution engine or runner to host a custom container image. Options include [Docker Hub](https://hub.docker.com/) or a "self-hosted" repository, including cloud-specific container registries like [Google Container Registry](https://cloud.google.com/container-registry) (GCR) or [Amazon Elastic Container Registry](https://aws.amazon.com/ecr/) (ECR). > **NOTE**: On Nov 20, 2020, Docker Hub put [rate limits](https://www.docker.com/increase-rate-limits) into effect for anonymous and free authenticated use, which may impact larger pipelines that pull containers several times. ### Building and pushing custom containers -Beam builds prebuilt images from [Dockerfiles](https://docs.docker.com/engine/reference/builder/). Users can build customized containers in one of two ways: +Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=image) are built from Dockerfiles checked into the [Github](https://github.com/apache/beam) repository and published to Docker Hub for every release. You can build customized containers in one of two ways: -1. **[Writing a new](#writing-new-dockerfiles) Dockerfile based on an existing prebuilt container**. This is sufficient for simple additions to the image, such as adding artifacts or environment variables. -2. **[Modifying](#modifying-dockerfiles) an existing Dockerfile in [Beam source](https://github.com/apache/beam)**. This method requires building from Beam source but allows for greater customization of the container (including replacement of artifacts or base OS/language versions). +1. **[Writing a new](#writing-new-dockerfiles) Dockerfile based on an existing prebuilt container image**. This is sufficient for simple additions to the image, such as adding artifacts or environment variables. +2. **[Modifying](#modifying-dockerfiles) a source Dockerfile in [Beam](https://github.com/apache/beam)**. This method requires building from Beam source but allows for greater customization of the container (including replacement of artifacts or base OS/language versions). -#### Writing new Dockerfiles on top of the original {#writing-new-dockerfiles} +#### Writing a new Dockerfile based on an existing published container image {#writing-new-dockerfiles} Steps: -1. Create a new Dockerfile that designates a base image using the [FROM instruction](https://docs.docker.com/engine/reference/builder/#from) - -2. Once you have a created a custom Dockerfile, [build](https://docs.docker.com/engine/reference/commandline/build/) and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker: - -As an example, this `Dockerfile`: +1. Create a new Dockerfile that designates a base image using the [FROM instruction](https://docs.docker.com/engine/reference/builder/#from). As an example, this `Dockerfile`: ``` FROM apache/beam_python3.7_sdk:2.25.0 @@ -65,22 +60,28 @@ COPY /src/path/to/file /dest/path/to/file/ uses the prebuilt Python 3.7 SDK container image [`beam_python3.7_sdk`](https://hub.docker.com/r/apache/beam_python3.7_sdk) tagged at (SDK version) `2.25.0`, and adds an additional environment variable and file to the image. + +2. [Build](https://docs.docker.com/engine/reference/commandline/build/) and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker. + + ``` export BASE_IMAGE="apache/beam_python3.7_sdk:2.25.0" export IMAGE_NAME="myremoterepo/mybeamsdk" export TAG="latest" -# Optional but recommended pull step to pull the base image into your local Docker daemon. +# Optional - pull the base image into your local Docker daemon to ensure +# you have the most up-to-date version of the base image locally. docker pull "${BASE_IMAGE}" + docker build -f Dockerfile -t "${IMAGE_NAME}:${TAG}" . docker push "${IMAGE_NAME}:${TAG}" ``` **NOTE**: After pushing a container image, you should verify the remote image ID and digest should match the local image ID and digest, output from `docker build` or `docker images`. -#### Modifying the original Dockerfile {#modifying-dockerfiles} in Beam source +#### Modifying a source Dockerfile {#modifying-dockerfiles} in Beam -This method will require building image artifacts from Beam source - see the [Contribution guide](contribute/#development-setup) for additional instructions on setting up your development environment. +This method will require building image artifacts from Beam source. For additional instructions on setting up your development environment, see the [Contribution guide](contribute/#development-setup). 1. Clone the `beam` repository. @@ -88,7 +89,7 @@ This method will require building image artifacts from Beam source - see the [Co git clone https://github.com/apache/beam.git ``` -2. Customize the `Dockerfile` for a given language. This file is typically in the `sdks//container` directory (e.g. the [Dockerfile for Python](https://github.com/apache/beam/blob/master/sdks/python/container/Dockerfile).. If you're adding dependencies from [PyPI](https://pypi.org/), use [`base_image_requirements.txt`](https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt) instead. +2. Customize the `Dockerfile` for a given language. This file is typically in the `sdks//container` directory (e.g. the [Dockerfile for Python](https://github.com/apache/beam/blob/master/sdks/python/container/Dockerfile). If you're adding dependencies from [PyPI](https://pypi.org/), use [`base_image_requirements.txt`](https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt) instead. 3. Navigate to the root directory of the local copy of your Apache Beam. @@ -133,7 +134,7 @@ docker push "${IMAGE_NAME}:${TAG}" **NOTE**: After pushing a container image, verify the remote image ID and digest matches the local image ID and digest output from `docker_images` -##### Additional Build Parameters +##### Additional build parameters The docker Gradle task defines a default image repository and [tag](https://docs.docker.com/engine/reference/commandline/tag/) is the SDK version defined at [gradle.properties](https://github.com/apache/beam/blob/master/gradle.properties). The default repository is the Docker Hub `apache` namespace, and the default tag is the [SDK version](https://github.com/apache/beam/blob/master/gradle.properties) defined at gradle.properties. With these settings, the `docker` command-line tool will implicitly try to push the container to the Docker Hub Apache repository. @@ -146,7 +147,7 @@ You can specify a different repository or tag for built images by providing para builds the Python 3.6 container and tags it as `example-repo/beam_python3.6_sdk:2019-10-04`. -From 2.21.0, a `docker-pull-licenses` flag was introduced to add licenses/notices for third party dependencies to the docker images. For example: +From Beam 2.21.0 and later, a `docker-pull-licenses` flag was introduced to add licenses/notices for third party dependencies to the docker images. For example: ``` ./gradlew :sdks:java:container:java8:docker -Pdocker-pull-licenses @@ -156,11 +157,13 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_ By default, no licenses/notices are added to the docker images. -## Using Container Images in Pipelines +## Using container images in pipelines The common method for providing a container image requires using the PortableRunner and setting the `--environment_config` flag to a given image path. Other runners, such as Dataflow, support specifying containers with different flags. +> **NOTE**: The Dataflow runner requires Beam SDK version >= 2.21.0. + {{< highlight class="runner-direct" >}} export IMAGE="my-repo/beam_python_sdk_custom" export TAG="X.Y.Z" @@ -216,6 +219,7 @@ export REGION="us-central1" # Run a pipeline on Dataflow. # This is a Python batch pipeline, so to run on Dataflow Runner V2 # you must specify the experiment "use_runner_v2" + python -m apache_beam.examples.wordcount \ --input gs://dataflow-samples/shakespeare/kinglear.txt \ --output "${GCS_PATH}/counts" \ From 3b17a8c496df80999c4106ee6868e39810e36b5c Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Mon, 30 Nov 2020 15:12:04 -0800 Subject: [PATCH 04/17] add gcr note --- .../site/content/en/documentation/runtime/environments.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index 29b173e20661..1510ce016f95 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -209,13 +209,15 @@ python -m apache_beam.examples.wordcount \ {{< /highlight >}} {{< highlight class="runner-dataflow" >}} -export IMAGE="my-repo/beam_python_sdk_custom" -export TAG="X.Y.Z" - export GCS_PATH="gs://my-gcs-bucket" export GCP_PROJECT="my-gcp-project" export REGION="us-central1" +# By default, the Dataflow runner will have access to the GCR images +# under the same project. +export IMAGE="gcr.io/$GCP_PROJECT/beam_python_sdk_custom" +export TAG="X.Y.Z" + # Run a pipeline on Dataflow. # This is a Python batch pipeline, so to run on Dataflow Runner V2 # you must specify the experiment "use_runner_v2" From 7244ba7f497ad00d7faff70d7dc49c0629667436 Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Mon, 30 Nov 2020 15:26:07 -0800 Subject: [PATCH 05/17] more review edits --- .../content/en/documentation/runtime/environments.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index 1510ce016f95..8d23e289e7a4 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -25,19 +25,22 @@ Prebuilt SDK container images are released per supported language during Beam re You may want to customize container images for many reasons, including: -* Pre-installing additional dependencies, -* Launching third-party software +* Pre-installing additional dependencies +* Launching third-party software in the worker environment +* Launching third-party software in the background * Further customizing the execution environment This guide describes how to create and use customized containers for the Beam SDK. ### Prerequisites -* You will need to have [Docker installed](https://docs.docker.com/get-docker/). +* You will need to use Docker, either by [installing Docker tools locally](https://docs.docker.com/get-docker/) or using build services that can run Docker, such as [Google Cloud Build](https://cloud.google.com/cloud-build/docs/building/build-containers). * You will need to have a container registry accessible by your execution engine or runner to host a custom container image. Options include [Docker Hub](https://hub.docker.com/) or a "self-hosted" repository, including cloud-specific container registries like [Google Container Registry](https://cloud.google.com/container-registry) (GCR) or [Amazon Elastic Container Registry](https://aws.amazon.com/ecr/) (ECR). > **NOTE**: On Nov 20, 2020, Docker Hub put [rate limits](https://www.docker.com/increase-rate-limits) into effect for anonymous and free authenticated use, which may impact larger pipelines that pull containers several times. +For optimal user experience, we also recommend you use the latest released version of Beam. + ### Building and pushing custom containers Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=image) are built from Dockerfiles checked into the [Github](https://github.com/apache/beam) repository and published to Docker Hub for every release. You can build customized containers in one of two ways: @@ -162,8 +165,6 @@ By default, no licenses/notices are added to the docker images. The common method for providing a container image requires using the PortableRunner and setting the `--environment_config` flag to a given image path. Other runners, such as Dataflow, support specifying containers with different flags. -> **NOTE**: The Dataflow runner requires Beam SDK version >= 2.21.0. - {{< highlight class="runner-direct" >}} export IMAGE="my-repo/beam_python_sdk_custom" export TAG="X.Y.Z" From 8c3f4678f45aaf5d92e8eb0034e87eabc1ee3545 Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Mon, 30 Nov 2020 16:06:04 -0800 Subject: [PATCH 06/17] more fixes --- .../site/content/en/documentation/runtime/environments.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index 8d23e289e7a4..839691b5cb57 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -17,9 +17,9 @@ limitations under the License. # Container environments -The Beam SDK runtime environment is [containerized](https://www.docker.com/resources/what-container) with [Docker](https://www.docker.com/) to isolate it from other runtime systems. This means any execution engine can run the Beam SDK. To learn more about the container environment, read the Beam [SDK Harness container contract](https://s.apache.org/beam-fn-api-container-contract). +The Beam SDK runtime environment can be [containerized](https://www.docker.com/resources/what-container) with [Docker](https://www.docker.com/) to isolate it from other runtime systems. To learn more about the container environment, read the Beam [SDK Harness container contract](https://s.apache.org/beam-fn-api-container-contract). -Prebuilt SDK container images are released per supported language during Beam releases and pushed to [Docker Hub](https://hub.docker.com/search?q=apache%2Fbeam&type=image) +Prebuilt SDK container images are released per supported language during Beam releases and pushed to [Docker Hub](https://hub.docker.com/search?q=apache%2Fbeam&type=image). ## Custom containers @@ -27,7 +27,6 @@ You may want to customize container images for many reasons, including: * Pre-installing additional dependencies * Launching third-party software in the worker environment -* Launching third-party software in the background * Further customizing the execution environment This guide describes how to create and use customized containers for the Beam SDK. @@ -45,7 +44,7 @@ For optimal user experience, we also recommend you use the latest released versi Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=image) are built from Dockerfiles checked into the [Github](https://github.com/apache/beam) repository and published to Docker Hub for every release. You can build customized containers in one of two ways: -1. **[Writing a new](#writing-new-dockerfiles) Dockerfile based on an existing prebuilt container image**. This is sufficient for simple additions to the image, such as adding artifacts or environment variables. +1. **[Writing a new](#writing-new-dockerfiles) Dockerfile based on a released container image**. This is sufficient for simple additions to the image, such as adding artifacts or environment variables. 2. **[Modifying](#modifying-dockerfiles) a source Dockerfile in [Beam](https://github.com/apache/beam)**. This method requires building from Beam source but allows for greater customization of the container (including replacement of artifacts or base OS/language versions). #### Writing a new Dockerfile based on an existing published container image {#writing-new-dockerfiles} From 2921f56a9ef879da03c569205d34b0941e05aa61 Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Mon, 30 Nov 2020 16:48:11 -0800 Subject: [PATCH 07/17] fix title --- .../www/site/content/en/documentation/runtime/environments.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index 839691b5cb57..2f5e1b3c7df5 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -81,7 +81,7 @@ docker push "${IMAGE_NAME}:${TAG}" **NOTE**: After pushing a container image, you should verify the remote image ID and digest should match the local image ID and digest, output from `docker build` or `docker images`. -#### Modifying a source Dockerfile {#modifying-dockerfiles} in Beam +#### Modifying a source Dockerfile in Beam {#modifying-dockerfiles} This method will require building image artifacts from Beam source. For additional instructions on setting up your development environment, see the [Contribution guide](contribute/#development-setup). From ebb04f5000f9e7cd30cfb9fd6f273f05e3861410 Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Wed, 2 Dec 2020 15:54:38 -0800 Subject: [PATCH 08/17] edits --- .../en/documentation/runtime/environments.md | 20 ++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index 2f5e1b3c7df5..a1f4b6332464 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -85,13 +85,19 @@ docker push "${IMAGE_NAME}:${TAG}" This method will require building image artifacts from Beam source. For additional instructions on setting up your development environment, see the [Contribution guide](contribute/#development-setup). -1. Clone the `beam` repository. +1. Clone the `beam` repository. It is recommended that you start from a stable + release branch rather than from master for both customizing the Dockerfile + and building image artifacts, and that you use the same version of the SDK + to run your pipeline with a custom container. ``` +export BEAM_SDK_VERSION="2.26.0" + git clone https://github.com/apache/beam.git +git checkout origin/release-$BEAM_SDK_VERSION ``` -2. Customize the `Dockerfile` for a given language. This file is typically in the `sdks//container` directory (e.g. the [Dockerfile for Python](https://github.com/apache/beam/blob/master/sdks/python/container/Dockerfile). If you're adding dependencies from [PyPI](https://pypi.org/), use [`base_image_requirements.txt`](https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt) instead. +3. Customize the `Dockerfile` for a given language. This file is typically in the `sdks//container` directory (e.g. the [Dockerfile for Python](https://github.com/apache/beam/blob/master/sdks/python/container/Dockerfile). If you're adding dependencies from [PyPI](https://pypi.org/), use [`base_image_requirements.txt`](https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt) instead. 3. Navigate to the root directory of the local copy of your Apache Beam. @@ -127,8 +133,9 @@ apache/beam_go_sdk latest ... 1 min If you did not provide a custom repo/tag as additional parameters (see below), you can retag the image and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker to a remote repository. ``` +export BEAM_SDK_VERSION="2.26.0" export IMAGE_NAME="myrepo/mybeamsdk" -export TAG="latest" +export TAG="${BEAM_SDK_VERSION}-custom" docker tag apache/beam_python3.6_sdk "${IMAGE_NAME}:${TAG}" docker push "${IMAGE_NAME}:${TAG}" @@ -138,16 +145,15 @@ docker push "${IMAGE_NAME}:${TAG}" ##### Additional build parameters -The docker Gradle task defines a default image repository and [tag](https://docs.docker.com/engine/reference/commandline/tag/) is the SDK version defined at [gradle.properties](https://github.com/apache/beam/blob/master/gradle.properties). The default repository is the Docker Hub `apache` namespace, and the default tag is the [SDK version](https://github.com/apache/beam/blob/master/gradle.properties) defined at gradle.properties. With these settings, the -`docker` command-line tool will implicitly try to push the container to the Docker Hub Apache repository. +The docker Gradle task defines a default image repository and [tag](https://docs.docker.com/engine/reference/commandline/tag/) is the SDK version defined at [gradle.properties](https://github.com/apache/beam/blob/master/gradle.properties). The default repository is the Docker Hub `apache` namespace, and the default tag is the [SDK version](https://github.com/apache/beam/blob/master/gradle.properties) defined at gradle.properties. You can specify a different repository or tag for built images by providing parameters to the build task. For example: ``` -./gradlew :sdks:python:container:py36:docker -Pdocker-repository-root=example-repo -Pdocker-tag=2019-10-04 +./gradlew :sdks:python:container:py36:docker -Pdocker-repository-root="example-repo" -Pdocker-tag="2.26.0-custom" ``` -builds the Python 3.6 container and tags it as `example-repo/beam_python3.6_sdk:2019-10-04`. +builds the Python 3.6 container and tags it as `example-repo/beam_python3.6_sdk:2.26.0-custom`. From Beam 2.21.0 and later, a `docker-pull-licenses` flag was introduced to add licenses/notices for third party dependencies to the docker images. For example: From b991bd319e6e9437230326d79d17e14e34417192 Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Fri, 4 Dec 2020 16:21:07 -0800 Subject: [PATCH 09/17] temp --- .../en/documentation/runtime/environments.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index a1f4b6332464..e62e29410d9b 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -173,18 +173,20 @@ Other runners, such as Dataflow, support specifying containers with different fl {{< highlight class="runner-direct" >}} export IMAGE="my-repo/beam_python_sdk_custom" export TAG="X.Y.Z" +export IMAGE_URL = "${IMAGE}:${TAG}" python -m apache_beam.examples.wordcount \ --input=/path/to/inputfile \ --output /path/to/write/counts \ --runner=PortableRunner \ --job_endpoint=embed \ ---environment_config="${IMAGE}:${TAG}" +--environment_options=docker_container_image=$IMAGE_URL {{< /highlight >}} {{< highlight class="runner-flink-local" >}} export IMAGE="my-repo/beam_python_sdk_custom" export TAG="X.Y.Z" +export IMAGE_URL = "${IMAGE}:${TAG}" # Start a Flink job server on localhost:8099 ./gradlew :runners:flink:1.8:job-server:runShadow @@ -195,12 +197,13 @@ python -m apache_beam.examples.wordcount \ --output=/path/to/write/counts \ --runner=PortableRunner \ --job_endpoint=localhost:8099 \ ---environment_config="${IMAGE}:${TAG}" +--environment_options=docker_container_image=$IMAGE_URL {{< /highlight >}} {{< highlight class="runner-spark-local" >}} export IMAGE="my-repo/beam_python_sdk_custom" export TAG="X.Y.Z" +export IMAGE_URL = "${IMAGE}:${TAG}" # Start a Spark job server on localhost:8099 ./gradlew :runners:spark:job-server:runShadow @@ -211,7 +214,7 @@ python -m apache_beam.examples.wordcount \ --output=path/to/write/counts \ --runner=PortableRunner \ --job_endpoint=localhost:8099 \ ---environment_config="${IMAGE}:${TAG}" +--environment_options=docker_container_image=$IMAGE_URL {{< /highlight >}} {{< highlight class="runner-dataflow" >}} @@ -221,8 +224,9 @@ export REGION="us-central1" # By default, the Dataflow runner will have access to the GCR images # under the same project. -export IMAGE="gcr.io/$GCP_PROJECT/beam_python_sdk_custom" +export IMAGE="my-repo/beam_python_sdk_custom" export TAG="X.Y.Z" +export IMAGE_URL = "${IMAGE}:${TAG}" # Run a pipeline on Dataflow. # This is a Python batch pipeline, so to run on Dataflow Runner V2 @@ -236,6 +240,6 @@ python -m apache_beam.examples.wordcount \ --region $REGION \ --temp_location "${GCS_PATH}/tmp/" \ --experiment=use_runner_v2 \ - --worker_harness_container_image="${IMAGE}:${TAG}" + --worker_harness_container_image=$IMAGE_URL {{< /highlight >}} From a632089d69663b8be5dba45101e7151e1d1f51e9 Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Mon, 7 Dec 2020 14:11:54 -0800 Subject: [PATCH 10/17] update Flink/Spark runners --- .../en/documentation/runtime/environments.md | 37 ++++++++++--------- 1 file changed, 20 insertions(+), 17 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index e62e29410d9b..4ed178bdfb72 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -167,9 +167,17 @@ By default, no licenses/notices are added to the docker images. ## Using container images in pipelines -The common method for providing a container image requires using the PortableRunner and setting the `--environment_config` flag to a given image path. +The common method for providing a container image requires using the +PortableRunner flag `--environment_config` as supported by the Portable +Runner or by runners supported PortableRunner flags. Other runners, such as Dataflow, support specifying containers with different flags. + + {{< highlight class="runner-direct" >}} export IMAGE="my-repo/beam_python_sdk_custom" export TAG="X.Y.Z" @@ -180,7 +188,8 @@ python -m apache_beam.examples.wordcount \ --output /path/to/write/counts \ --runner=PortableRunner \ --job_endpoint=embed \ ---environment_options=docker_container_image=$IMAGE_URL +--environment_type="DOCKER" \ +--environment_config="${IMAGE_URL}" {{< /highlight >}} {{< highlight class="runner-flink-local" >}} @@ -188,16 +197,13 @@ export IMAGE="my-repo/beam_python_sdk_custom" export TAG="X.Y.Z" export IMAGE_URL = "${IMAGE}:${TAG}" -# Start a Flink job server on localhost:8099 -./gradlew :runners:flink:1.8:job-server:runShadow - -# Run a pipeline on the Flink job server +# Run a pipeline using the FlinkRunner which starts a Flink job server. python -m apache_beam.examples.wordcount \ --input=/path/to/inputfile \ ---output=/path/to/write/counts \ ---runner=PortableRunner \ ---job_endpoint=localhost:8099 \ ---environment_options=docker_container_image=$IMAGE_URL +--output=path/to/write/counts \ +--runner=FlinkRunner \ +--environment_type="DOCKER" \ +--environment_config="${IMAGE_URL}" {{< /highlight >}} {{< highlight class="runner-spark-local" >}} @@ -205,16 +211,13 @@ export IMAGE="my-repo/beam_python_sdk_custom" export TAG="X.Y.Z" export IMAGE_URL = "${IMAGE}:${TAG}" -# Start a Spark job server on localhost:8099 -./gradlew :runners:spark:job-server:runShadow - -# Run a pipeline on the Spark job server +# Run a pipeline using the SparkRunner which starts the Spark job server python -m apache_beam.examples.wordcount \ --input=/path/to/inputfile \ --output=path/to/write/counts \ ---runner=PortableRunner \ ---job_endpoint=localhost:8099 \ ---environment_options=docker_container_image=$IMAGE_URL +--runner=SparkRunner \ +--environment_type="DOCKER" \ +--environment_config="${IMAGE_URL}" {{< /highlight >}} {{< highlight class="runner-dataflow" >}} From cbf75a0094450657af6f783659d9220fef57f098 Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Tue, 8 Dec 2020 11:06:12 -0800 Subject: [PATCH 11/17] clean up Docker instructions --- .../en/documentation/runtime/environments.md | 131 +++++++++--------- 1 file changed, 68 insertions(+), 63 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index 4ed178bdfb72..ab4fba23cfa2 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -49,9 +49,7 @@ Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=i #### Writing a new Dockerfile based on an existing published container image {#writing-new-dockerfiles} -Steps: - -1. Create a new Dockerfile that designates a base image using the [FROM instruction](https://docs.docker.com/engine/reference/builder/#from). As an example, this `Dockerfile`: +1. Create a new Dockerfile that designates a base image using the [FROM instruction](https://docs.docker.com/engine/reference/builder/#from). ``` FROM apache/beam_python3.7_sdk:2.25.0 @@ -60,90 +58,97 @@ ENV FOO=bar COPY /src/path/to/file /dest/path/to/file/ ``` -uses the prebuilt Python 3.7 SDK container image [`beam_python3.7_sdk`](https://hub.docker.com/r/apache/beam_python3.7_sdk) tagged at (SDK version) `2.25.0`, and adds an additional environment variable and file to the image. +This `Dockerfile`: uses the prebuilt Python 3.7 SDK container image [`beam_python3.7_sdk`](https://hub.docker.com/r/apache/beam_python3.7_sdk) tagged at (SDK version) `2.25.0`, and adds an additional environment variable and file to the image. 2. [Build](https://docs.docker.com/engine/reference/commandline/build/) and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker. + ``` + export BASE_IMAGE="apache/beam_python3.7_sdk:2.25.0" + export IMAGE_NAME="myremoterepo/mybeamsdk" + export TAG="latest" -``` -export BASE_IMAGE="apache/beam_python3.7_sdk:2.25.0" -export IMAGE_NAME="myremoterepo/mybeamsdk" -export TAG="latest" + # Optional - pull the base image into your local Docker daemon to ensure + # you have the most up-to-date version of the base image locally. + docker pull "${BASE_IMAGE}" -# Optional - pull the base image into your local Docker daemon to ensure -# you have the most up-to-date version of the base image locally. -docker pull "${BASE_IMAGE}" + docker build -f Dockerfile -t "${IMAGE_NAME}:${TAG}" . + ``` -docker build -f Dockerfile -t "${IMAGE_NAME}:${TAG}" . -docker push "${IMAGE_NAME}:${TAG}" -``` +3. If your runner is running remotely, you will need to retag the image and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker to a remote repository accessible by your runner. + + ``` + docker push "${IMAGE_NAME}:${TAG}" + ``` -**NOTE**: After pushing a container image, you should verify the remote image ID and digest should match the local image ID and digest, output from `docker build` or `docker images`. +4. After pushing a container image, you should verify the remote image ID and digest should match the local image ID and digest, output from `docker build` or `docker images`. #### Modifying a source Dockerfile in Beam {#modifying-dockerfiles} This method will require building image artifacts from Beam source. For additional instructions on setting up your development environment, see the [Contribution guide](contribute/#development-setup). -1. Clone the `beam` repository. It is recommended that you start from a stable - release branch rather than from master for both customizing the Dockerfile - and building image artifacts, and that you use the same version of the SDK - to run your pipeline with a custom container. +>**NOTE**: It is recommended that you start from a stable release branch (`release-X.XX.X`) corresponding to the same version of the SDK to run your pipeline. Differences in SDK version may result in unexpected errors. -``` -export BEAM_SDK_VERSION="2.26.0" +1. Clone the `beam` repository. -git clone https://github.com/apache/beam.git -git checkout origin/release-$BEAM_SDK_VERSION -``` + ``` + export BEAM_SDK_VERSION="2.26.0" + git clone https://github.com/apache/beam.git + cd beam -3. Customize the `Dockerfile` for a given language. This file is typically in the `sdks//container` directory (e.g. the [Dockerfile for Python](https://github.com/apache/beam/blob/master/sdks/python/container/Dockerfile). If you're adding dependencies from [PyPI](https://pypi.org/), use [`base_image_requirements.txt`](https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt) instead. + # Save current directory as working directory + export BEAM_WORKDIR=$PWD -3. Navigate to the root directory of the local copy of your Apache Beam. + git checkout origin/release-$BEAM_SDK_VERSION + ``` -4. Run Gradle with the `docker` target. +2. Customize the `Dockerfile` for a given language, typically `sdks//container/Dockerfile` directory (e.g. the [Dockerfile for Python](https://github.com/apache/beam/blob/master/sdks/python/container/Dockerfile). If you're adding dependencies from [PyPI](https://pypi.org/), use [`base_image_requirements.txt`](https://github.com/apache/beam/blob/master/sdks/python/container/base_image_requirements.txt) instead. +3. Return to the root Beam directory and run the Gradle `docker` target for your image. -``` -# The default repository of each SDK -./gradlew :sdks:java:container:java8:docker -./gradlew :sdks:java:container:java11:docker -./gradlew :sdks:go:container:docker -./gradlew :sdks:python:container:py36:docker -./gradlew :sdks:python:container:py37:docker -./gradlew :sdks:python:container:py38:docker - -# Shortcut for building all Python SDKs -./gradlew :sdks:python:container buildAll -``` + ``` + cd $BEAM_WORKDIR -To examine the containers that you built, run `docker images`: + # The default repository of each SDK + ./gradlew :sdks:java:container:java8:docker + ./gradlew :sdks:java:container:java11:docker + ./gradlew :sdks:go:container:docker + ./gradlew :sdks:python:container:py36:docker + ./gradlew :sdks:python:container:py37:docker + ./gradlew :sdks:python:container:py38:docker -``` -$> docker images -REPOSITORY TAG IMAGE ID CREATED SIZE -apache/beam_java8_sdk latest ... 1 min ago ... -apache/beam_java11_sdk latest ... 1 min ago ... -apache/beam_python3.6_sdk latest ... 1 min ago ... -apache/beam_python3.7_sdk latest ... 1 min ago ... -apache/beam_python3.8_sdk latest ... 1 min ago ... -apache/beam_go_sdk latest ... 1 min ago ... -``` + # Shortcut for building all Python SDKs + ./gradlew :sdks:python:container buildAll + ``` -If you did not provide a custom repo/tag as additional parameters (see below), you can retag the image and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker to a remote repository. +4. Verify the images you built were created by running `docker images`. -``` -export BEAM_SDK_VERSION="2.26.0" -export IMAGE_NAME="myrepo/mybeamsdk" -export TAG="${BEAM_SDK_VERSION}-custom" + ``` + $> docker images --digests + REPOSITORY TAG DIGEST IMAGE ID CREATED SIZE + apache/beam_java8_sdk latest sha256:... ... 1 min ago ... + apache/beam_java11_sdk latest sha256:... ... 1 min ago ... + apache/beam_python3.6_sdk latest sha256:... ... 1 min ago ... + apache/beam_python3.7_sdk latest sha256:... ... 1 min ago ... + apache/beam_python3.8_sdk latest sha256:... ... 1 min ago ... + apache/beam_go_sdk latest sha256:... ... 1 min ago ... + ``` -docker tag apache/beam_python3.6_sdk "${IMAGE_NAME}:${TAG}" -docker push "${IMAGE_NAME}:${TAG}" -``` +5. If your runner is running remotely, you will need to retag the image and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker to a remote repository accessible by your runner. + You can also provide a custom repo/tag as [additional parameters](#additional-build-parameters). + + ``` + export BEAM_SDK_VERSION="2.26.0" + export IMAGE_NAME="gcr.io/my-gcp-project/beam_python3.7_sdk" + export TAG="${BEAM_SDK_VERSION}-custom" + + docker tag apache/beam_python3.7_sdk "${IMAGE_NAME}:${TAG}" + docker push "${IMAGE_NAME}:${TAG}" + ``` -**NOTE**: After pushing a container image, verify the remote image ID and digest matches the local image ID and digest output from `docker_images` +6. After pushing a container image, verify the remote image ID and digest matches the local image ID and digest output from `docker_images --digests`. -##### Additional build parameters +#### Additional build parameters{#additional-build-parameters} The docker Gradle task defines a default image repository and [tag](https://docs.docker.com/engine/reference/commandline/tag/) is the SDK version defined at [gradle.properties](https://github.com/apache/beam/blob/master/gradle.properties). The default repository is the Docker Hub `apache` namespace, and the default tag is the [SDK version](https://github.com/apache/beam/blob/master/gradle.properties) defined at gradle.properties. @@ -173,9 +178,9 @@ Runner or by runners supported PortableRunner flags. Other runners, such as Dataflow, support specifying containers with different flags. {{< highlight class="runner-direct" >}} From 1e1a84ed9986cad4cbf9f78fcf29adb0695086f9 Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Tue, 8 Dec 2020 11:41:38 -0800 Subject: [PATCH 12/17] add back a slash --- .../www/site/content/en/documentation/runtime/environments.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index ab4fba23cfa2..bf0c35f4ec2d 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -85,7 +85,7 @@ This `Dockerfile`: uses the prebuilt Python 3.7 SDK container image [`beam_pytho #### Modifying a source Dockerfile in Beam {#modifying-dockerfiles} -This method will require building image artifacts from Beam source. For additional instructions on setting up your development environment, see the [Contribution guide](contribute/#development-setup). +This method will require building image artifacts from Beam source. For additional instructions on setting up your development environment, see the [Contribution guide](/contribute/#development-setup). >**NOTE**: It is recommended that you start from a stable release branch (`release-X.XX.X`) corresponding to the same version of the SDK to run your pipeline. Differences in SDK version may result in unexpected errors. From 9c223a68849403a13c013bfce419224d2219f8a3 Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Tue, 8 Dec 2020 12:42:59 -0800 Subject: [PATCH 13/17] add notes --- .../site/content/en/documentation/runtime/environments.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index bf0c35f4ec2d..152bd9553985 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -170,7 +170,7 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_ By default, no licenses/notices are added to the docker images. -## Using container images in pipelines +## Running pipelines with custom container images The common method for providing a container image requires using the PortableRunner flag `--environment_config` as supported by the Portable @@ -183,6 +183,11 @@ Other runners, such as Dataflow, support specifying containers with different fl runners --> +>**NOTE**: Differences in language and SDK version between the container SDK and +pipeline SDK may result in unexpected errors due to incompatibility. For best +results, make sure to use the same stable SDK version for your base container +and when running your pipeline. + {{< highlight class="runner-direct" >}} export IMAGE="my-repo/beam_python_sdk_custom" export TAG="X.Y.Z" From 062023a1d5abed8cc74ddc58ae5b4398956ce978 Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Tue, 8 Dec 2020 13:35:12 -0800 Subject: [PATCH 14/17] add troubleshooting section --- .../en/documentation/runtime/environments.md | 23 +++++++++++++++---- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index 152bd9553985..f8095cafbd2b 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -183,11 +183,6 @@ Other runners, such as Dataflow, support specifying containers with different fl runners --> ->**NOTE**: Differences in language and SDK version between the container SDK and -pipeline SDK may result in unexpected errors due to incompatibility. For best -results, make sure to use the same stable SDK version for your base container -and when running your pipeline. - {{< highlight class="runner-direct" >}} export IMAGE="my-repo/beam_python_sdk_custom" export TAG="X.Y.Z" @@ -256,3 +251,21 @@ python -m apache_beam.examples.wordcount \ --worker_harness_container_image=$IMAGE_URL {{< /highlight >}} + + +### Troubleshooting/TIps + +* Differences in language and SDK version between the container SDK and + pipeline SDK may result in unexpected errors due to incompatibility. For best + results, make sure to use the same stable SDK version for your base container + and when running your pipeline. +* If you are running into unexpected errors when using remote containers, + make sure that your container exists in the remote repository and can be + accesses. Local runners will attempt to pull remote images and default to + local images if it exists locally. If an image cannot be pulled by the local + docker daemon, you may see an log message like: + + ``` + Error response from daemon: manifest for remote.repo/beam_python3.7_sdk:2.25.0-custom not found: manifest unknown: ... + INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:Unable to pull image... + ``` From 7c3086e27571daf39eebb83b5841811bee7adaac Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Tue, 8 Dec 2020 15:46:57 -0800 Subject: [PATCH 15/17] add anchor --- .../www/site/content/en/documentation/runtime/environments.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index f8095cafbd2b..096280c87bc6 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -170,7 +170,7 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_ By default, no licenses/notices are added to the docker images. -## Running pipelines with custom container images +## Running pipelines with custom container images {#running-pipelines} The common method for providing a container image requires using the PortableRunner flag `--environment_config` as supported by the Portable From b7a4fb67fcc787905336ee7cda05a838687a36ac Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Tue, 15 Dec 2020 11:23:05 -0800 Subject: [PATCH 16/17] doc fixes --- .../en/documentation/runtime/environments.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index 096280c87bc6..684287c08226 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -253,7 +253,11 @@ python -m apache_beam.examples.wordcount \ {{< /highlight >}} -### Troubleshooting/TIps +### Troubleshooting + +The following section describes some common issues to consider +when you encounter unexpected errors running Beam pipelines with +custom containers. * Differences in language and SDK version between the container SDK and pipeline SDK may result in unexpected errors due to incompatibility. For best @@ -261,10 +265,10 @@ python -m apache_beam.examples.wordcount \ and when running your pipeline. * If you are running into unexpected errors when using remote containers, make sure that your container exists in the remote repository and can be - accesses. Local runners will attempt to pull remote images and default to - local images if it exists locally. If an image cannot be pulled by the local - docker daemon, you may see an log message like: - + accessed by any third-party service, if needed. +* Local runners will attempt to pull remote images and default to local + images. If an image cannot be pulled locally (by the docker daemon), + you may see an log message like: ``` Error response from daemon: manifest for remote.repo/beam_python3.7_sdk:2.25.0-custom not found: manifest unknown: ... INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:Unable to pull image... From 4c35b8a89825387fee7626fed082e17648b282bd Mon Sep 17 00:00:00 2001 From: "emilyye@google.com" Date: Thu, 17 Dec 2020 13:35:20 -0800 Subject: [PATCH 17/17] no should/will --- .../en/documentation/runtime/environments.md | 25 +++++++------------ 1 file changed, 9 insertions(+), 16 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index 684287c08226..952a04c7a9d8 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -33,8 +33,8 @@ You may want to customize container images for many reasons, including: ### Prerequisites -* You will need to use Docker, either by [installing Docker tools locally](https://docs.docker.com/get-docker/) or using build services that can run Docker, such as [Google Cloud Build](https://cloud.google.com/cloud-build/docs/building/build-containers). -* You will need to have a container registry accessible by your execution engine or runner to host a custom container image. Options include [Docker Hub](https://hub.docker.com/) or a "self-hosted" repository, including cloud-specific container registries like [Google Container Registry](https://cloud.google.com/container-registry) (GCR) or [Amazon Elastic Container Registry](https://aws.amazon.com/ecr/) (ECR). +* This guide requires building images using Docker. [Install Docker locally](https://docs.docker.com/get-docker/). Some CI/CD platforms like [Google Cloud Build](https://cloud.google.com/cloud-build/docs/building/build-containers) also provide the ability to build images using Docker. +* For remote execution engines/runners, have a container registry to host your custom container image. Options include [Docker Hub](https://hub.docker.com/) or a "self-hosted" repository, including cloud-specific container registries like [Google Container Registry](https://cloud.google.com/container-registry) (GCR) or [Amazon Elastic Container Registry](https://aws.amazon.com/ecr/) (ECR). Make sure your registry can be accessed by your execution engine or runner. > **NOTE**: On Nov 20, 2020, Docker Hub put [rate limits](https://www.docker.com/increase-rate-limits) into effect for anonymous and free authenticated use, which may impact larger pipelines that pull containers several times. @@ -58,7 +58,7 @@ ENV FOO=bar COPY /src/path/to/file /dest/path/to/file/ ``` -This `Dockerfile`: uses the prebuilt Python 3.7 SDK container image [`beam_python3.7_sdk`](https://hub.docker.com/r/apache/beam_python3.7_sdk) tagged at (SDK version) `2.25.0`, and adds an additional environment variable and file to the image. +This `Dockerfile` uses the prebuilt Python 3.7 SDK container image [`beam_python3.7_sdk`](https://hub.docker.com/r/apache/beam_python3.7_sdk) tagged at (SDK version) `2.25.0`, and adds an additional environment variable and file to the image. 2. [Build](https://docs.docker.com/engine/reference/commandline/build/) and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker. @@ -75,17 +75,17 @@ This `Dockerfile`: uses the prebuilt Python 3.7 SDK container image [`beam_pytho docker build -f Dockerfile -t "${IMAGE_NAME}:${TAG}" . ``` -3. If your runner is running remotely, you will need to retag the image and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker to a remote repository accessible by your runner. +3. If your runner is running remotely, retag and [push](https://docs.docker.com/engine/reference/commandline/push/) the image to the appropriate repository. ``` docker push "${IMAGE_NAME}:${TAG}" ``` -4. After pushing a container image, you should verify the remote image ID and digest should match the local image ID and digest, output from `docker build` or `docker images`. +4. After pushing a container image, verify the remote image ID and digest matches the local image ID and digest, output from `docker build` or `docker images`. #### Modifying a source Dockerfile in Beam {#modifying-dockerfiles} -This method will require building image artifacts from Beam source. For additional instructions on setting up your development environment, see the [Contribution guide](/contribute/#development-setup). +This method requires building image artifacts from Beam source. For additional instructions on setting up your development environment, see the [Contribution guide](/contribute/#development-setup). >**NOTE**: It is recommended that you start from a stable release branch (`release-X.XX.X`) corresponding to the same version of the SDK to run your pipeline. Differences in SDK version may result in unexpected errors. @@ -134,8 +134,7 @@ This method will require building image artifacts from Beam source. For addition apache/beam_go_sdk latest sha256:... ... 1 min ago ... ``` -5. If your runner is running remotely, you will need to retag the image and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker to a remote repository accessible by your runner. - You can also provide a custom repo/tag as [additional parameters](#additional-build-parameters). +5. If your runner is running remotely, retag the image and [push](https://docs.docker.com/engine/reference/commandline/push/) the image to your repository. You can skip this step if you provide a custom repo/tag as [additional parameters](#additional-build-parameters). ``` export BEAM_SDK_VERSION="2.26.0" @@ -177,12 +176,6 @@ PortableRunner flag `--environment_config` as supported by the Portable Runner or by runners supported PortableRunner flags. Other runners, such as Dataflow, support specifying containers with different flags. - - {{< highlight class="runner-direct" >}} export IMAGE="my-repo/beam_python_sdk_custom" export TAG="X.Y.Z" @@ -230,7 +223,7 @@ export GCS_PATH="gs://my-gcs-bucket" export GCP_PROJECT="my-gcp-project" export REGION="us-central1" -# By default, the Dataflow runner will have access to the GCR images +# By default, the Dataflow runner has access to the GCR images # under the same project. export IMAGE="my-repo/beam_python_sdk_custom" export TAG="X.Y.Z" @@ -266,7 +259,7 @@ custom containers. * If you are running into unexpected errors when using remote containers, make sure that your container exists in the remote repository and can be accessed by any third-party service, if needed. -* Local runners will attempt to pull remote images and default to local +* Local runners attempt to pull remote images and default to local images. If an image cannot be pulled locally (by the docker daemon), you may see an log message like: ```