Skip to content

[Feature Request]: Teach Azure Filesystem to authenticate using DefaultAzureCredential in the Python SDK #24210

@creste

Description

@creste

Problem

Currently, the Azure Filesystem for the Python SDK only supports authenticating using the AZURE_STORAGE_CONNECTION_STRING environment variable. That approach has several limitations:

  • The AZURE_STORAGE_CONNECTION_STRING environment variable must be defined on all systems where the pipeline executes. This is difficult to configure when using Beam worker-pool sidecar containers with the FlinkRunner because Flink may be running in session mode with different Beam pipelines needing different connection strings.
  • The call to BlobServiceClient.from_connection_string() does not support all of the authentication methods supported by DefaultAzureCredential. For my use case in particular, it does not support Managed Identity credentials.

Solution

I plan to address the above limitations in a PR by adding new Azure-specific pipeline options described below.

--azure_connection_string

Specifies the Azure Storage Connection String.

Can be used instead of the AZURE_STORAGE_CONNECTION_STRING environment variable or the new --blob_service_endpoint pipeline option described below.

Example:

python -m apache_beam.examples.wordcount \
  --input azfs://devstoreaccount1/container/* \
  --output azfs://devstoreaccount1/container/py-wordcount-integration \
  --azure_connection_string "DefaultEndpointsProtocol=https;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=https://azurite:10000/devstoreaccount1;"

--blob_service_endpoint

Specifies the Azure Blob Storage Account Endpoint URL.

Can be used instead of the AZURE_STORAGE_CONNECTION_STRING environment variable or the new --azure_connection_string pipeline option described above.

This pipeline option uses DefaultAzureCredential() to authenticate.

Example:

python -m apache_beam.examples.wordcount \
  --input azfs://devstoreaccount1/container/* \
  --output azfs://devstoreaccount1/container/py-wordcount-integration \
  --blob_service_endpoint https://mystorageaccount.blob.core.windows.net/

--azure_managed_identity_client_id

Specifies the Managed Identity Client ID. Can only be used with --blob_service_endpoint.

This pipeline option uses DefaultAzureCredential(managed_identity_client_id=client_id) to authenticate.

Example:

python -m apache_beam.examples.wordcount \
  --input azfs://devstoreaccount1/container/* \
  --output azfs://devstoreaccount1/container/py-wordcount-integration \
  --blob_service_endpoint https://devstoreaccount1.blob.core.windows.net/ \
  --azure_managed_identity_client_id ca6cc1a3-4b82-48bd-97ca-8e799c0abff6

Testing

Per #20511, the Azure Filesystem does not have integration tests against Azure or Azurite. I plan to add integration tests for the new pipeline options to run against Azurite, similar to how HDFS does its integration tests.

Issue Priority

Priority: 2

Issue Component

Component: io-py-ideas

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions