-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
Problem
Currently, the Azure Filesystem for the Python SDK only supports authenticating using the AZURE_STORAGE_CONNECTION_STRING environment variable. That approach has several limitations:
- The
AZURE_STORAGE_CONNECTION_STRINGenvironment variable must be defined on all systems where the pipeline executes. This is difficult to configure when using Beam worker-pool sidecar containers with the FlinkRunner because Flink may be running in session mode with different Beam pipelines needing different connection strings. - The call to
BlobServiceClient.from_connection_string()does not support all of the authentication methods supported by DefaultAzureCredential. For my use case in particular, it does not support Managed Identity credentials.
Solution
I plan to address the above limitations in a PR by adding new Azure-specific pipeline options described below.
--azure_connection_string
Specifies the Azure Storage Connection String.
Can be used instead of the AZURE_STORAGE_CONNECTION_STRING environment variable or the new --blob_service_endpoint pipeline option described below.
Example:
python -m apache_beam.examples.wordcount \
--input azfs://devstoreaccount1/container/* \
--output azfs://devstoreaccount1/container/py-wordcount-integration \
--azure_connection_string "DefaultEndpointsProtocol=https;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=https://azurite:10000/devstoreaccount1;"--blob_service_endpoint
Specifies the Azure Blob Storage Account Endpoint URL.
Can be used instead of the AZURE_STORAGE_CONNECTION_STRING environment variable or the new --azure_connection_string pipeline option described above.
This pipeline option uses DefaultAzureCredential() to authenticate.
Example:
python -m apache_beam.examples.wordcount \
--input azfs://devstoreaccount1/container/* \
--output azfs://devstoreaccount1/container/py-wordcount-integration \
--blob_service_endpoint https://mystorageaccount.blob.core.windows.net/--azure_managed_identity_client_id
Specifies the Managed Identity Client ID. Can only be used with --blob_service_endpoint.
This pipeline option uses DefaultAzureCredential(managed_identity_client_id=client_id) to authenticate.
Example:
python -m apache_beam.examples.wordcount \
--input azfs://devstoreaccount1/container/* \
--output azfs://devstoreaccount1/container/py-wordcount-integration \
--blob_service_endpoint https://devstoreaccount1.blob.core.windows.net/ \
--azure_managed_identity_client_id ca6cc1a3-4b82-48bd-97ca-8e799c0abff6Testing
Per #20511, the Azure Filesystem does not have integration tests against Azure or Azurite. I plan to add integration tests for the new pipeline options to run against Azurite, similar to how HDFS does its integration tests.
Issue Priority
Priority: 2
Issue Component
Component: io-py-ideas