From d00f7443de056ea903d95c8aa71c5ea2124e284e Mon Sep 17 00:00:00 2001 From: Anand Inguva Date: Tue, 4 Jan 2022 10:24:33 -0500 Subject: [PATCH 01/12] Added custom containers --- .../sdks/python-pipeline-dependencies.md | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md index 0f34bbcbd1fc..91aa42803300 100644 --- a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md +++ b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md @@ -53,7 +53,7 @@ If your pipeline uses packages that are not available publicly (e.g. packages th 1. Identify which packages are installed on your machine and are not public. Run the following command: - pip freeze + pip freeze This command lists all packages that are installed on your machine, regardless of where they were installed from. @@ -68,6 +68,19 @@ If your pipeline uses packages that are not available publicly (e.g. packages th See the [sdist documentation](https://docs.python.org/2/distutils/sourcedist.html) for more details on this command. +## Using Custom Containers + +You can pass a container image with all the dependencies that are needed for the pipeline instead of `requirements.txt`. [See the page on how to run Pipeline with Custom Containers](https://beam.apache.org/documentation/runtime/environments/#running-pipelines). + +1. If you are passing a custom container image and have a `requirements.txt` file, we recommend you to install the dependencies from the requirements file when building your container image. In this case, you would reduce the pipeline startup time and do not need to pass `--requirements_file` option at runtime. + + # Add these lines with the path to the requirements.txt to the Dockerfile + + COPY /tmp/requirements.txt + RUN python -m pip download -r /tmp/requirements.txt --exists-action i --no-binary :all: + +**Note:** Follow these [instructions](https://beam.apache.org/documentation/runtime/environments/#writing-new-dockerfiles) to build the custom container images on top of Apache Beam Python SDK. + ## Multiple File Dependencies Often, your pipeline code spans multiple files. To run your project remotely, you must group these files as a Python package and specify the package when you run your pipeline. When the remote workers start, they will install your package. To group your files as a Python package and make it available remotely, perform the following steps: @@ -123,3 +136,4 @@ If your pipeline uses non-Python packages (e.g. packages that require installati --setup_file /path/to/setup.py **Note:** Because custom commands execute after the dependencies for your workflow are installed (by `pip`), you should omit the PyPI package dependency from the pipeline's `requirements.txt` file and from the `install_requires` parameter in the `setuptools.setup()` call of your `setup.py` file. + From 68e851f39055563fe6ec26f53eecdea0ff0274f7 Mon Sep 17 00:00:00 2001 From: Anand Inguva Date: Wed, 19 Jan 2022 22:38:52 -0500 Subject: [PATCH 02/12] Add instructions for pre-building container image --- .../sdks/python-pipeline-dependencies.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md index 91aa42803300..07d6346b226a 100644 --- a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md +++ b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md @@ -80,6 +80,7 @@ You can pass a container image with all the dependencies that are needed for the RUN python -m pip download -r /tmp/requirements.txt --exists-action i --no-binary :all: **Note:** Follow these [instructions](https://beam.apache.org/documentation/runtime/environments/#writing-new-dockerfiles) to build the custom container images on top of Apache Beam Python SDK. + ## Multiple File Dependencies @@ -137,3 +138,18 @@ If your pipeline uses non-Python packages (e.g. packages that require installati **Note:** Because custom commands execute after the dependencies for your workflow are installed (by `pip`), you should omit the PyPI package dependency from the pipeline's `requirements.txt` file and from the `install_requires` parameter in the `setuptools.setup()` call of your `setup.py` file. +## Pre-building SDK container image + +In the pre-building step, we install pipeline dependencies on the container image prior to the job submission. This would speed up the pipeline execution.\ +To use pre-building the dependencies from `requirements.txt` on the container image. Follow the steps below. +1. Provide the container engine. We support docker and cloud build for now. + + --prebuild_sdk_container_enginer +2. To pass a base image for pre-building dependencies, enable this flag. If not, apache beam's base image would be used. + + --prebuild_sdk_container_base_image +3. To push the pre-build image to a docker repository, provide URL to the docker registry by passing + + --docker_registry_push_url + +**NOTE**: For now, this feature is available only for the Dataflow. From 072ead74d6a22f23812df38e28747493051f9dd5 Mon Sep 17 00:00:00 2001 From: Anand Inguva Date: Thu, 24 Feb 2022 13:40:58 -0500 Subject: [PATCH 03/12] Add documentation on using custom containers, prebuilding workflows --- .../en/documentation/runtime/environments.md | 46 ++++++++++++++++++- .../sdks/python-pipeline-dependencies.md | 36 ++++++++------- 2 files changed, 64 insertions(+), 18 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index 9588bbfc3a31..e8a8055a18fb 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -46,7 +46,7 @@ Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=i 1. **[Writing a new](#writing-new-dockerfiles) Dockerfile based on a released container image**. This is sufficient for simple additions to the image, such as adding artifacts or environment variables. 2. **[Modifying](#modifying-dockerfiles) a source Dockerfile in [Beam](https://github.com/apache/beam)**. This method requires building from Beam source but allows for greater customization of the container (including replacement of artifacts or base OS/language versions). - +3. **[Build](#modify-existing-base-image) an existing container image to make it compatible with Apache Beam Runners**. This method is used when users start from an existing image, and configure the image to be compatible with Apache Beam Runners. #### Writing a new Dockerfile based on an existing published container image {#writing-new-dockerfiles} 1. Create a new Dockerfile that designates a base image using the [FROM instruction](https://docs.docker.com/engine/reference/builder/#from). @@ -171,6 +171,50 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_ By default, no licenses/notices are added to the docker images. +#### Build an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image} +Beam offers a way to take a Beam container image and customize it. But if you have an existing base image to be compatible with Apache Beam Runners, use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process to copy over the necessary artifacts from a default Apache Beam base image and provide your custom container image. + + +1. Copy necessary artifacts from Apache Beam base image to your image. + ``` + # This can be any container image, + FROM python:3.8-slim + + # Install SDK. (needed for Python SDK) + RUN pip install --no-cache-dir apache-beam[gcp]==2.25.0 + + # Copy files from official SDK image, including script/dependencies. + COPY --from=apache/beam_python3.7_sdk:2.25.0 /opt/apache/beam /opt/apache/beam + + # Perform any addtional customizations if desired + + # Set the entrypoint to Apache Beam SDK launcher. + ENTRYPOINT ["/opt/apache/beam/boot"] + + ``` +> **NOTE**: This example assumes necessary dependencies (in this case, Python 3.7 and pip) have been installed on the existing base image. Installing the Apache Beam SDK into the image will ensure that the image has the necessary SDK dependencies and reduce the worker startup time. +> The version specified in the `RUN` instruction must match the version used to launch the pipeline.
+> **Users need to make sure that whatever base image they use has the same Python/Java interpreter version that they use to run the pipeline**. + + +2. [Build](https://docs.docker.com/engine/reference/commandline/build/) and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker. + + ``` + export BASE_IMAGE="apache/beam_python3.7_sdk:2.25.0" + export IMAGE_NAME="myremoterepo/mybeamsdk" + export TAG="latest" + + # Optional - pull the base image into your local Docker daemon to ensure + # you have the most up-to-date version of the base image locally. + docker pull "${BASE_IMAGE}" + + docker build -f Dockerfile -t "${IMAGE_NAME}:${TAG}" . + ``` + +3. If your runner is running remotely, retag the image and [push](https://docs.docker.com/engine/reference/commandline/push/) the image to your repository. + ``` + docker push "${IMAGE_NAME}:${TAG}" + ``` ## Running pipelines with custom container images {#running-pipelines} diff --git a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md index 07d6346b226a..9a33ebd7d6be 100644 --- a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md +++ b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md @@ -45,6 +45,19 @@ If your pipeline uses public packages from the [Python Package Index](https://py The runner will use the `requirements.txt` file to install your additional dependencies onto the remote workers. **Important:** Remote workers will install all packages listed in the `requirements.txt` file. Because of this, it's very important that you delete non-PyPI packages from the `requirements.txt` file, as stated in step 2. If you don't remove non-PyPI packages, the remote workers will fail when attempting to install packages from sources that are unknown to them. +> **NOTE**: An alternative to `pip check` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile the `requirements.txt` with all the dependencies required for the pipeline. +## Custom Containers {#custom-containers} + +You can pass a [container](https://hub.docker.com/search?q=apache%2Fbeam&type=image) image with all the dependencies that are needed for the pipeline instead of `requirements.txt`. [Follow the instructions on how to run pipeline with Custom Container images](https://beam.apache.org/documentation/runtime/environments/#running-pipelines). + +1. If you are passing a custom container image, `--sdk_container_image` at runtime and specify `--requirements_file` option, we recommend you to install the dependencies from the `--requirements_file` when building your container image. In this case, you would reduce the pipeline startup time and do not need to pass `--requirements_file` option at runtime. + + # Add these lines with the path to the requirements.txt to the Dockerfile + + COPY /tmp/requirements.txt + RUN python -m pip download -r /tmp/requirements.txt + +**Note:** [Different approaches](https://beam.apache.org/documentation/runtime/environments/#writing-new-dockerfiles) to build the container images that would be compatible with Apache Beam Runners. ## Local or non-PyPI Dependencies {#local-or-nonpypi} @@ -68,19 +81,6 @@ If your pipeline uses packages that are not available publicly (e.g. packages th See the [sdist documentation](https://docs.python.org/2/distutils/sourcedist.html) for more details on this command. -## Using Custom Containers - -You can pass a container image with all the dependencies that are needed for the pipeline instead of `requirements.txt`. [See the page on how to run Pipeline with Custom Containers](https://beam.apache.org/documentation/runtime/environments/#running-pipelines). - -1. If you are passing a custom container image and have a `requirements.txt` file, we recommend you to install the dependencies from the requirements file when building your container image. In this case, you would reduce the pipeline startup time and do not need to pass `--requirements_file` option at runtime. - - # Add these lines with the path to the requirements.txt to the Dockerfile - - COPY /tmp/requirements.txt - RUN python -m pip download -r /tmp/requirements.txt --exists-action i --no-binary :all: - -**Note:** Follow these [instructions](https://beam.apache.org/documentation/runtime/environments/#writing-new-dockerfiles) to build the custom container images on top of Apache Beam Python SDK. - ## Multiple File Dependencies @@ -142,14 +142,16 @@ If your pipeline uses non-Python packages (e.g. packages that require installati In the pre-building step, we install pipeline dependencies on the container image prior to the job submission. This would speed up the pipeline execution.\ To use pre-building the dependencies from `requirements.txt` on the container image. Follow the steps below. -1. Provide the container engine. We support docker and cloud build for now. +1. Provide the container engine. We support `docker` and `cloud_build`(requires a GCP project with Cloud Build API enabled). --prebuild_sdk_container_enginer 2. To pass a base image for pre-building dependencies, enable this flag. If not, apache beam's base image would be used. --prebuild_sdk_container_base_image -3. To push the pre-build image to a docker repository, provide URL to the docker registry by passing - +3. To push the container image, pre-built locally with `Docker` , to a remote repository(eg: docker registry), provide URL to the docker registry by passing + --docker_registry_push_url -**NOTE**: For now, this feature is available only for the Dataflow. +> To use Docker, the `--prebuild_sdk_container_base_image` should be compatible with Apache Beam Runner. Please follow the [instructions](https://beam.apache.org/documentation/runtime/environments/#building-and-pushing-custom-containers) on how to build a base container image compatible with Apache Beam. + +**NOTE**: For now, this feature is available only for the `Dataflow`. From 422d9ba7422809ccc4147934512dfe92ed64590a Mon Sep 17 00:00:00 2001 From: Anand Inguva Date: Thu, 24 Feb 2022 14:47:02 -0500 Subject: [PATCH 04/12] Fixup: whitespaces --- .../content/en/documentation/runtime/environments.md | 11 +++++------ .../sdks/python-pipeline-dependencies.md | 7 ++----- 2 files changed, 7 insertions(+), 11 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index e8a8055a18fb..4c79a43cdc28 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -192,22 +192,21 @@ Beam offers a way to take a Beam container image and customize it. But if you ha ENTRYPOINT ["/opt/apache/beam/boot"] ``` -> **NOTE**: This example assumes necessary dependencies (in this case, Python 3.7 and pip) have been installed on the existing base image. Installing the Apache Beam SDK into the image will ensure that the image has the necessary SDK dependencies and reduce the worker startup time. -> The version specified in the `RUN` instruction must match the version used to launch the pipeline.
-> **Users need to make sure that whatever base image they use has the same Python/Java interpreter version that they use to run the pipeline**. +>**NOTE**: This example assumes necessary dependencies (in this case, Python 3.7 and pip) have been installed on the existing base image. Installing the Apache Beam SDK into the image will ensure that the image has the necessary SDK dependencies and reduce the worker startup time. +>The version specified in the `RUN` instruction must match the version used to launch the pipeline.
+>**Users need to make sure that whatever base image they use has the same Python/Java interpreter version that they use to run the pipeline**. 2. [Build](https://docs.docker.com/engine/reference/commandline/build/) and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker. - ``` export BASE_IMAGE="apache/beam_python3.7_sdk:2.25.0" export IMAGE_NAME="myremoterepo/mybeamsdk" export TAG="latest" - + # Optional - pull the base image into your local Docker daemon to ensure # you have the most up-to-date version of the base image locally. docker pull "${BASE_IMAGE}" - + docker build -f Dockerfile -t "${IMAGE_NAME}:${TAG}" . ``` diff --git a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md index 9a33ebd7d6be..ad75c12e601f 100644 --- a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md +++ b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md @@ -45,7 +45,7 @@ If your pipeline uses public packages from the [Python Package Index](https://py The runner will use the `requirements.txt` file to install your additional dependencies onto the remote workers. **Important:** Remote workers will install all packages listed in the `requirements.txt` file. Because of this, it's very important that you delete non-PyPI packages from the `requirements.txt` file, as stated in step 2. If you don't remove non-PyPI packages, the remote workers will fail when attempting to install packages from sources that are unknown to them. -> **NOTE**: An alternative to `pip check` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile the `requirements.txt` with all the dependencies required for the pipeline. +> **NOTE**: An alternative to `pip check` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile the `requirements.txt` with all the dependencies required for the pipeline. ## Custom Containers {#custom-containers} You can pass a [container](https://hub.docker.com/search?q=apache%2Fbeam&type=image) image with all the dependencies that are needed for the pipeline instead of `requirements.txt`. [Follow the instructions on how to run pipeline with Custom Container images](https://beam.apache.org/documentation/runtime/environments/#running-pipelines). @@ -81,7 +81,6 @@ If your pipeline uses packages that are not available publicly (e.g. packages th See the [sdist documentation](https://docs.python.org/2/distutils/sourcedist.html) for more details on this command. - ## Multiple File Dependencies Often, your pipeline code spans multiple files. To run your project remotely, you must group these files as a Python package and specify the package when you run your pipeline. When the remote workers start, they will install your package. To group your files as a Python package and make it available remotely, perform the following steps: @@ -142,7 +141,7 @@ If your pipeline uses non-Python packages (e.g. packages that require installati In the pre-building step, we install pipeline dependencies on the container image prior to the job submission. This would speed up the pipeline execution.\ To use pre-building the dependencies from `requirements.txt` on the container image. Follow the steps below. -1. Provide the container engine. We support `docker` and `cloud_build`(requires a GCP project with Cloud Build API enabled). +1. Provide the container engine. We support `docker` and `cloud_build`(requires a GCP project with Cloud Build API enabled). --prebuild_sdk_container_enginer 2. To pass a base image for pre-building dependencies, enable this flag. If not, apache beam's base image would be used. @@ -151,7 +150,5 @@ To use pre-building the dependencies from `requirements.txt` on the container im 3. To push the container image, pre-built locally with `Docker` , to a remote repository(eg: docker registry), provide URL to the docker registry by passing --docker_registry_push_url - > To use Docker, the `--prebuild_sdk_container_base_image` should be compatible with Apache Beam Runner. Please follow the [instructions](https://beam.apache.org/documentation/runtime/environments/#building-and-pushing-custom-containers) on how to build a base container image compatible with Apache Beam. - **NOTE**: For now, this feature is available only for the `Dataflow`. From f9ec6f4b8e35b1bd6cd705221f1945d15a854375 Mon Sep 17 00:00:00 2001 From: Anand Inguva Date: Mon, 7 Mar 2022 11:58:08 -0500 Subject: [PATCH 05/12] Fix documentation --- .../en/documentation/runtime/environments.md | 14 +++++++------- .../sdks/python-pipeline-dependencies.md | 9 +++------ 2 files changed, 10 insertions(+), 13 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index 4c79a43cdc28..930186c006f5 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -42,11 +42,11 @@ For optimal user experience, we also recommend you use the latest released versi ### Building and pushing custom containers -Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=image) are built from Dockerfiles checked into the [Github](https://github.com/apache/beam) repository and published to Docker Hub for every release. You can build customized containers in one of two ways: +Beam [SDK container images](https://hub.docker.com/search?q=apache%2Fbeam&type=image) are built from Dockerfiles checked into the [Github](https://github.com/apache/beam) repository and published to Docker Hub for every release. You can build customized containers in one of three ways: 1. **[Writing a new](#writing-new-dockerfiles) Dockerfile based on a released container image**. This is sufficient for simple additions to the image, such as adding artifacts or environment variables. 2. **[Modifying](#modifying-dockerfiles) a source Dockerfile in [Beam](https://github.com/apache/beam)**. This method requires building from Beam source but allows for greater customization of the container (including replacement of artifacts or base OS/language versions). -3. **[Build](#modify-existing-base-image) an existing container image to make it compatible with Apache Beam Runners**. This method is used when users start from an existing image, and configure the image to be compatible with Apache Beam Runners. +3. **[Modifying](#modify-existing-base-image) an existing container image to make it compatible with Apache Beam Runners**. This method is used when users start from an existing image, and configure the image to be compatible with Apache Beam Runners. #### Writing a new Dockerfile based on an existing published container image {#writing-new-dockerfiles} 1. Create a new Dockerfile that designates a base image using the [FROM instruction](https://docs.docker.com/engine/reference/builder/#from). @@ -171,20 +171,20 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_ By default, no licenses/notices are added to the docker images. -#### Build an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image} +#### Modifying an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image} Beam offers a way to take a Beam container image and customize it. But if you have an existing base image to be compatible with Apache Beam Runners, use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process to copy over the necessary artifacts from a default Apache Beam base image and provide your custom container image. 1. Copy necessary artifacts from Apache Beam base image to your image. ``` # This can be any container image, - FROM python:3.8-slim + FROM python:3.7-bullseye # Install SDK. (needed for Python SDK) - RUN pip install --no-cache-dir apache-beam[gcp]==2.25.0 + RUN pip install --no-cache-dir apache-beam[gcp]==2.35.0 # Copy files from official SDK image, including script/dependencies. - COPY --from=apache/beam_python3.7_sdk:2.25.0 /opt/apache/beam /opt/apache/beam + COPY --from=apache/beam_python3.7_sdk:2.35.0 /opt/apache/beam /opt/apache/beam # Perform any addtional customizations if desired @@ -194,7 +194,7 @@ Beam offers a way to take a Beam container image and customize it. But if you ha ``` >**NOTE**: This example assumes necessary dependencies (in this case, Python 3.7 and pip) have been installed on the existing base image. Installing the Apache Beam SDK into the image will ensure that the image has the necessary SDK dependencies and reduce the worker startup time. >The version specified in the `RUN` instruction must match the version used to launch the pipeline.
->**Users need to make sure that whatever base image they use has the same Python/Java interpreter version that they use to run the pipeline**. +>**Users need to make sure that whatever base image they use has the same Python/Java interpreter version that they used to run the pipeline**. 2. [Build](https://docs.docker.com/engine/reference/commandline/build/) and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker. diff --git a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md index ad75c12e601f..942ffbd6d468 100644 --- a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md +++ b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md @@ -45,19 +45,16 @@ If your pipeline uses public packages from the [Python Package Index](https://py The runner will use the `requirements.txt` file to install your additional dependencies onto the remote workers. **Important:** Remote workers will install all packages listed in the `requirements.txt` file. Because of this, it's very important that you delete non-PyPI packages from the `requirements.txt` file, as stated in step 2. If you don't remove non-PyPI packages, the remote workers will fail when attempting to install packages from sources that are unknown to them. -> **NOTE**: An alternative to `pip check` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile the `requirements.txt` with all the dependencies required for the pipeline. +> **NOTE**: An alternative to `pip freeze` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile the all the dependencies required for the pipeline from a `--requirements_file`, where only top-level dependencies are mentioned. ## Custom Containers {#custom-containers} You can pass a [container](https://hub.docker.com/search?q=apache%2Fbeam&type=image) image with all the dependencies that are needed for the pipeline instead of `requirements.txt`. [Follow the instructions on how to run pipeline with Custom Container images](https://beam.apache.org/documentation/runtime/environments/#running-pipelines). -1. If you are passing a custom container image, `--sdk_container_image` at runtime and specify `--requirements_file` option, we recommend you to install the dependencies from the `--requirements_file` when building your container image. In this case, you would reduce the pipeline startup time and do not need to pass `--requirements_file` option at runtime. +1. If you are using a custom container image, we recommend that you install the dependencies from the `--requirements_file` directly into your image at build time. In this case, you do not need to pass `--requirements_file` option at runtime, which will reduce the pipeline startup time. # Add these lines with the path to the requirements.txt to the Dockerfile - COPY /tmp/requirements.txt - RUN python -m pip download -r /tmp/requirements.txt - -**Note:** [Different approaches](https://beam.apache.org/documentation/runtime/environments/#writing-new-dockerfiles) to build the container images that would be compatible with Apache Beam Runners. + RUN python -m pip install -r /tmp/requirements.txt ## Local or non-PyPI Dependencies {#local-or-nonpypi} From 98facb0332d5e72b0e1cf3dea906872eac6a853e Mon Sep 17 00:00:00 2001 From: Anand Inguva <34158215+AnandInguva@users.noreply.github.com> Date: Mon, 7 Mar 2022 12:00:08 -0500 Subject: [PATCH 06/12] Apply suggestions from code review Co-authored-by: tvalentyn --- .../www/site/content/en/documentation/runtime/environments.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index 930186c006f5..d4d4bb116720 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -172,7 +172,7 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_ By default, no licenses/notices are added to the docker images. #### Modifying an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image} -Beam offers a way to take a Beam container image and customize it. But if you have an existing base image to be compatible with Apache Beam Runners, use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process to copy over the necessary artifacts from a default Apache Beam base image and provide your custom container image. +Beam offers a way to take a Beam container image and customize it. But if you have an existing base image that you need to make compatible with Apache Beam Runners, use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process to copy over the necessary artifacts from a default Apache Beam base image. 1. Copy necessary artifacts from Apache Beam base image to your image. From d2b55baf15776f5573aa4253e3a10db0272973b7 Mon Sep 17 00:00:00 2001 From: Anand Inguva Date: Mon, 7 Mar 2022 14:22:51 -0500 Subject: [PATCH 07/12] Fix typos and update documentation --- .../site/content/en/documentation/runtime/environments.md | 5 ++--- .../en/documentation/sdks/python-pipeline-dependencies.md | 6 +++--- 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index d4d4bb116720..e55ccf885f22 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -172,8 +172,7 @@ creates a Java 8 SDK image with appropriate licenses in `/opt/apache/beam/third_ By default, no licenses/notices are added to the docker images. #### Modifying an existing container image to make it compatible with Apache Beam Runners {#modify-existing-base-image} -Beam offers a way to take a Beam container image and customize it. But if you have an existing base image that you need to make compatible with Apache Beam Runners, use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process to copy over the necessary artifacts from a default Apache Beam base image. - +Beam offers a way to provide your own custom container image. The easiest way to build a new custom image that is compatible with Apache Beam Runners is to use a [multi-stage build](https://docs.docker.com/develop/develop-images/multistage-build/) process. This copies over the necessary artifacts from a default Apache Beam base image to build your custom container image. 1. Copy necessary artifacts from Apache Beam base image to your image. ``` @@ -186,7 +185,7 @@ Beam offers a way to take a Beam container image and customize it. But if you ha # Copy files from official SDK image, including script/dependencies. COPY --from=apache/beam_python3.7_sdk:2.35.0 /opt/apache/beam /opt/apache/beam - # Perform any addtional customizations if desired + # Perform any additional customizations if desired # Set the entrypoint to Apache Beam SDK launcher. ENTRYPOINT ["/opt/apache/beam/boot"] diff --git a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md index 942ffbd6d468..72ddc988243a 100644 --- a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md +++ b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md @@ -138,13 +138,13 @@ If your pipeline uses non-Python packages (e.g. packages that require installati In the pre-building step, we install pipeline dependencies on the container image prior to the job submission. This would speed up the pipeline execution.\ To use pre-building the dependencies from `requirements.txt` on the container image. Follow the steps below. -1. Provide the container engine. We support `docker` and `cloud_build`(requires a GCP project with Cloud Build API enabled). +1. Provide the container engine. We support `local_docker` and `cloud_build`(requires a GCP project with Cloud Build API enabled). - --prebuild_sdk_container_enginer + --prebuild_sdk_container_engine 2. To pass a base image for pre-building dependencies, enable this flag. If not, apache beam's base image would be used. --prebuild_sdk_container_base_image -3. To push the container image, pre-built locally with `Docker` , to a remote repository(eg: docker registry), provide URL to the docker registry by passing +3. To push the container image, pre-built locally with `local_docker` , to a remote repository(eg: docker registry), provide URL to the docker registry by passing --docker_registry_push_url > To use Docker, the `--prebuild_sdk_container_base_image` should be compatible with Apache Beam Runner. Please follow the [instructions](https://beam.apache.org/documentation/runtime/environments/#building-and-pushing-custom-containers) on how to build a base container image compatible with Apache Beam. From c367ab32e32bf8e9c91bebab97fc48984c756e48 Mon Sep 17 00:00:00 2001 From: Anand Inguva Date: Wed, 9 Mar 2022 12:52:36 -0500 Subject: [PATCH 08/12] Change container image option for prebuilding --- .../documentation/sdks/python-pipeline-dependencies.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md index 72ddc988243a..4f745fc0a54d 100644 --- a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md +++ b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md @@ -143,9 +143,10 @@ To use pre-building the dependencies from `requirements.txt` on the container im --prebuild_sdk_container_engine 2. To pass a base image for pre-building dependencies, enable this flag. If not, apache beam's base image would be used. - --prebuild_sdk_container_base_image -3. To push the container image, pre-built locally with `local_docker` , to a remote repository(eg: docker registry), provide URL to the docker registry by passing + --sdk_container_image +3. To push the container image, pre-built locally with `local_docker` , to a remote repository(eg: docker registry), provide URL to the remote registry by passing --docker_registry_push_url -> To use Docker, the `--prebuild_sdk_container_base_image` should be compatible with Apache Beam Runner. Please follow the [instructions](https://beam.apache.org/documentation/runtime/environments/#building-and-pushing-custom-containers) on how to build a base container image compatible with Apache Beam. -**NOTE**: For now, this feature is available only for the `Dataflow`. +> To use Docker, the `--sdk_container_image` should be compatible with Apache Beam Runner. Please follow the [instructions](https://beam.apache.org/documentation/runtime/environments/#building-and-pushing-custom-containers) on how to build a base container image compatible with Apache Beam. + +**NOTE**: This feature is available only for the `DataflowRunner`. From f7122447e08ac463dcf569b92f7e283ad1772ff5 Mon Sep 17 00:00:00 2001 From: Anand Inguva Date: Fri, 18 Mar 2022 13:30:49 -0400 Subject: [PATCH 09/12] update documentation --- .../sdks/python-pipeline-dependencies.md | 25 ++++++++++++------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md index 4f745fc0a54d..9e9b6c1eef8e 100644 --- a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md +++ b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md @@ -46,6 +46,7 @@ If your pipeline uses public packages from the [Python Package Index](https://py **Important:** Remote workers will install all packages listed in the `requirements.txt` file. Because of this, it's very important that you delete non-PyPI packages from the `requirements.txt` file, as stated in step 2. If you don't remove non-PyPI packages, the remote workers will fail when attempting to install packages from sources that are unknown to them. > **NOTE**: An alternative to `pip freeze` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile the all the dependencies required for the pipeline from a `--requirements_file`, where only top-level dependencies are mentioned. + ## Custom Containers {#custom-containers} You can pass a [container](https://hub.docker.com/search?q=apache%2Fbeam&type=image) image with all the dependencies that are needed for the pipeline instead of `requirements.txt`. [Follow the instructions on how to run pipeline with Custom Container images](https://beam.apache.org/documentation/runtime/environments/#running-pipelines). @@ -136,17 +137,23 @@ If your pipeline uses non-Python packages (e.g. packages that require installati ## Pre-building SDK container image -In the pre-building step, we install pipeline dependencies on the container image prior to the job submission. This would speed up the pipeline execution.\ -To use pre-building the dependencies from `requirements.txt` on the container image. Follow the steps below. -1. Provide the container engine. We support `local_docker` and `cloud_build`(requires a GCP project with Cloud Build API enabled). +In pipeline execution modes where a Beam runner launches SDK workers in Docker containers, the additional pipeline dependencies (specified via `--requirements_file` and other runtime options) are installed into the containers at runtime. This can increase the worker startup time. + However, it may be possible to pre-build the SDK containers and perform the dependency installation once before the workers start. To pre-build the container image before pipeline submission, provide the pipeline options mentioned below. +1. Provide the container engine. We support `local_docker`(requires local installation of Docker) and `cloud_build`(requires a GCP project with Cloud Build API enabled). + + --prebuild_sdk_container_engine= +2. To pass a base image for pre-building dependencies, provide `--sdk_container_image`. If not, Apache beam's base [image](https://hub.docker.com/search?q=apache%2Fbeam&type=image) would be used. - --prebuild_sdk_container_engine -2. To pass a base image for pre-building dependencies, enable this flag. If not, apache beam's base image would be used. + --sdk_container_image= +3. If using `local_docker` engine, provide a URL for the remote registry to which the image will be pushed by passing + + --docker_registry_push_url= - --sdk_container_image -3. To push the container image, pre-built locally with `local_docker` , to a remote repository(eg: docker registry), provide URL to the remote registry by passing + # Example: --docker_registry_push_url=/beam + # pre-built image will be pushed to the /beam/beam_python_prebuilt_sdk: + # tag is generated by Beam SDK. - --docker_registry_push_url + **NOTE:** `docker_registry_push_url` must be a remote registry. > To use Docker, the `--sdk_container_image` should be compatible with Apache Beam Runner. Please follow the [instructions](https://beam.apache.org/documentation/runtime/environments/#building-and-pushing-custom-containers) on how to build a base container image compatible with Apache Beam. -**NOTE**: This feature is available only for the `DataflowRunner`. +**NOTE**: This feature is available only for the `Dataflow Runner v2`. From 758ee0acc6224d3f2a6b2bf306b181f62cd82587 Mon Sep 17 00:00:00 2001 From: Anand Inguva Date: Fri, 18 Mar 2022 15:28:34 -0400 Subject: [PATCH 10/12] Fixup: whitespace --- .../en/documentation/sdks/python-pipeline-dependencies.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md index 9e9b6c1eef8e..7afabfc0b57f 100644 --- a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md +++ b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md @@ -146,9 +146,8 @@ In pipeline execution modes where a Beam runner launches SDK workers in Docker c --sdk_container_image= 3. If using `local_docker` engine, provide a URL for the remote registry to which the image will be pushed by passing - - --docker_registry_push_url= + --docker_registry_push_url= # Example: --docker_registry_push_url=/beam # pre-built image will be pushed to the /beam/beam_python_prebuilt_sdk: # tag is generated by Beam SDK. From 860c4f00ce7d54256b98cc8a458d06f73125f7e9 Mon Sep 17 00:00:00 2001 From: Anand Inguva Date: Fri, 18 Mar 2022 17:35:38 -0400 Subject: [PATCH 11/12] Fix up documentation --- .../www/site/content/en/documentation/runtime/environments.md | 2 +- .../en/documentation/sdks/python-pipeline-dependencies.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/website/www/site/content/en/documentation/runtime/environments.md b/website/www/site/content/en/documentation/runtime/environments.md index e55ccf885f22..c38c58b06ff1 100644 --- a/website/www/site/content/en/documentation/runtime/environments.md +++ b/website/www/site/content/en/documentation/runtime/environments.md @@ -193,7 +193,7 @@ Beam offers a way to provide your own custom container image. The easiest way to ``` >**NOTE**: This example assumes necessary dependencies (in this case, Python 3.7 and pip) have been installed on the existing base image. Installing the Apache Beam SDK into the image will ensure that the image has the necessary SDK dependencies and reduce the worker startup time. >The version specified in the `RUN` instruction must match the version used to launch the pipeline.
->**Users need to make sure that whatever base image they use has the same Python/Java interpreter version that they used to run the pipeline**. +>**Make sure that the Python or Java runtime version specified in the base image is the same as the version used to run the pipeline.** 2. [Build](https://docs.docker.com/engine/reference/commandline/build/) and [push](https://docs.docker.com/engine/reference/commandline/push/) the image using Docker. diff --git a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md index 7afabfc0b57f..fa4ed86235cf 100644 --- a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md +++ b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md @@ -45,7 +45,7 @@ If your pipeline uses public packages from the [Python Package Index](https://py The runner will use the `requirements.txt` file to install your additional dependencies onto the remote workers. **Important:** Remote workers will install all packages listed in the `requirements.txt` file. Because of this, it's very important that you delete non-PyPI packages from the `requirements.txt` file, as stated in step 2. If you don't remove non-PyPI packages, the remote workers will fail when attempting to install packages from sources that are unknown to them. -> **NOTE**: An alternative to `pip freeze` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile the all the dependencies required for the pipeline from a `--requirements_file`, where only top-level dependencies are mentioned. +> **NOTE**: An alternative to `pip freeze` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile all the dependencies required for the pipeline from a `--requirements_file`, where only top-level dependencies are mentioned. ## Custom Containers {#custom-containers} @@ -139,7 +139,7 @@ If your pipeline uses non-Python packages (e.g. packages that require installati In pipeline execution modes where a Beam runner launches SDK workers in Docker containers, the additional pipeline dependencies (specified via `--requirements_file` and other runtime options) are installed into the containers at runtime. This can increase the worker startup time. However, it may be possible to pre-build the SDK containers and perform the dependency installation once before the workers start. To pre-build the container image before pipeline submission, provide the pipeline options mentioned below. -1. Provide the container engine. We support `local_docker`(requires local installation of Docker) and `cloud_build`(requires a GCP project with Cloud Build API enabled). +1. Provide the container engine. Beam supports `local_docker`(requires local installation of Docker) and `cloud_build`(requires a GCP project with Cloud Build API enabled). --prebuild_sdk_container_engine= 2. To pass a base image for pre-building dependencies, provide `--sdk_container_image`. If not, Apache beam's base [image](https://hub.docker.com/search?q=apache%2Fbeam&type=image) would be used. From 9ad0ba9dfba886132bcca5dca3c02dbba210a911 Mon Sep 17 00:00:00 2001 From: Anand Inguva Date: Tue, 29 Mar 2022 10:04:19 -0400 Subject: [PATCH 12/12] Update doc --- .../sdks/python-pipeline-dependencies.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md index fa4ed86235cf..4df2374cf476 100644 --- a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md +++ b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md @@ -142,10 +142,8 @@ In pipeline execution modes where a Beam runner launches SDK workers in Docker c 1. Provide the container engine. Beam supports `local_docker`(requires local installation of Docker) and `cloud_build`(requires a GCP project with Cloud Build API enabled). --prebuild_sdk_container_engine= -2. To pass a base image for pre-building dependencies, provide `--sdk_container_image`. If not, Apache beam's base [image](https://hub.docker.com/search?q=apache%2Fbeam&type=image) would be used. - --sdk_container_image= -3. If using `local_docker` engine, provide a URL for the remote registry to which the image will be pushed by passing +2. If using `local_docker` engine, provide a URL for the remote registry to which the image will be pushed by passing --docker_registry_push_url= # Example: --docker_registry_push_url=/beam @@ -153,6 +151,11 @@ In pipeline execution modes where a Beam runner launches SDK workers in Docker c # tag is generated by Beam SDK. **NOTE:** `docker_registry_push_url` must be a remote registry. -> To use Docker, the `--sdk_container_image` should be compatible with Apache Beam Runner. Please follow the [instructions](https://beam.apache.org/documentation/runtime/environments/#building-and-pushing-custom-containers) on how to build a base container image compatible with Apache Beam. +> The pre-building feature requires the Apache Beam SDK for Python, version 2.25.0 or later. +The container images created during prebuilding will persist beyond the pipeline runtime. +Once your job is finished or stopped, you can remove the pre-built image from the container registry. + +>If your pipeline is using a custom container image, most likely you will not benefit from pre-building step as extra dependencies can be preinstalled in the custom image at build time. If you still would like to use pre-building with custom images, use Apache Beam SDK 2.38.0 or newer and + supply your custom image via `--sdk_container_image` pipeline option. **NOTE**: This feature is available only for the `Dataflow Runner v2`.