Issues connecting to S3 on EKS

I'm attempting to use S3 deep storage on EKS, however I just get a 403 error. I'm not in a position to use a client secret pair from our AWS account directly. But the nodes within our K8s cluster have service accounts. Attached to my Druid clusters namespace is a role which has all permissions for a specific bucket. However, when I attempt to load the sample dataset into Druid, I get an AWS 403 error in the logs.

There's a web token file set in the environment variables, which typically any AWS SDK related stuff normally picks up. I'm also explicitly passing in the region etc

### Affected Version

`0.20, 0.21, 0.21.1-rc`

### Description

Please include as much detailed information about the problem as possible.
- Cluster size
Two to three m5.large's

- Configurations in use
```
apiVersion: druid.apache.org/v1alpha1
kind: Druid
metadata:
  name: ewanstenant
spec:
  commonConfigMountPath: /opt/druid/conf/druid/cluster/_common
  serviceAccount: "druid-scaling-spike"
  nodeSelector:
    service: ewanstenant-druid
  tolerations:
    - key: 'dedicated'
      operator: 'Equal'
      value: 'ewanstenant-druid'
      effect: 'NoSchedule'
  securityContext:
    fsGroup: 0
    runAsUser: 0
    runAsGroup: 0
  image: "apache/druid:0.21.1-rc1"
  startScript: /druid.sh
  jvm.options: |-
    -server
    -XX:+UseG1GC
    -Xloggc:gc-%t-%p.log
    -XX:+UseGCLogFileRotation
    -XX:GCLogFileSize=100M
    -XX:NumberOfGCLogFiles=10
    -XX:+HeapDumpOnOutOfMemoryError
    -XX:HeapDumpPath=/druid/data/logs
    -verbose:gc
    -XX:+PrintGCDetails
    -XX:+PrintGCTimeStamps
    -XX:+PrintGCDateStamps
    -XX:+PrintGCApplicationStoppedTime
    -XX:+PrintGCApplicationConcurrentTime
    -XX:+PrintAdaptiveSizePolicy
    -XX:+PrintReferenceGC
    -XX:+PrintFlagsFinal
    -Duser.timezone=UTC
    -Dfile.encoding=UTF-8
    -Djava.io.tmpdir=/druid/data
    -Daws.region=eu-west-1
    -Dorg.jboss.logging.provider=slf4j
    -Dlog4j.shutdownCallbackRegistry=org.apache.druid.common.config.Log4jShutdown
    -Dlog4j.shutdownHookEnabled=true
    -Dcom.sun.management.jmxremote.authenticate=false
    -Dcom.sun.management.jmxremote.ssl=false
  common.runtime.properties: |
    ###############################################
    # service names for coordinator and overlord
    ###############################################
    druid.selectors.indexing.serviceName=druid/overlord
    druid.selectors.coordinator.serviceName=druid/coordinator
    ##################################################
    # Request logging, monitoring, and segment
    ##################################################
    druid.request.logging.type=slf4j
    druid.request.logging.feed=requests
    ##################################################
    # Monitoring ( enable when using prometheus )
    #################################################
    
    ################################################
    # Extensions
    ################################################
    druid.extensions.directory=/opt/druid/extensions
    druid.extensions.loadList=["druid-s3-extensions","postgresql-metadata-storage"]
    ####################################################
    # Enable sql
    ####################################################
    druid.sql.enable=true

    druid.storage.type=s3
    druid.storage.bucket=druid-scaling-spike-deepstore
    druid.storage.baseKey=druid/segments
    druid.indexer.logs.directory=data/logs/
    druid.storage.sse.type=s3
    druid.storage.disableAcl=false


    # druid.storage.type=local
    # druid.storage.storageDirectory=/druid/deepstorage

    druid.metadata.storage.type=derby
    druid.metadata.storage.connector.connectURI=jdbc:derby://localhost:1527/druid/data/derbydb/metadata.db;create=true
    druid.metadata.storage.connector.host=localhost
    druid.metadata.storage.connector.port=1527
    druid.metadata.storage.connector.createTables=true

    druid.zk.service.host=tiny-cluster-zk-0.tiny-cluster-zk
    druid.zk.paths.base=/druid
    druid.zk.service.compress=false

    druid.indexer.logs.type=file
    druid.indexer.logs.directory=/druid/data/indexing-logs
    druid.lookup.enableLookupSyncOnStartup=false
  volumeClaimTemplates:
    - 
      metadata:
        name: deepstorage-volume
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 50Gi
        storageClassName: gp2
  volumeMounts:
    - mountPath: /druid/data
      name: data-volume
    - mountPath: /druid/deepstorage
      name: deepstorage-volume
  volumes:
    - name: data-volume
      emptyDir: {}
    - name: deepstorage-volume
      hostPath:
        path: /tmp/druid/deepstorage
        type: DirectoryOrCreate

  nodes:
    brokers: 
      kind: Deployment
      druid.port: 8080
      nodeType: broker
      nodeConfigMountPath: "/opt/druid/conf/druid/cluster/query/broker"
      env:
        - name: DRUID_XMS
          value: 12000m
        - name: DRUID_XMX
          value: 12000m
        - name: DRUID_MAXDIRECTMEMORYSIZE
          value: 8g
        - name: AWS_REGION
          value: eu-west-1
      replicas: 1
      resources:
        limits:
          cpu: 1
          memory: 8Gi
        requests:
          cpu: 1
          memory: 8Gi
      readinessProbe:
        initialDelaySeconds: 60
        periodSeconds: 10
        failureThreshold: 30
        httpGet:
          path: /druid/broker/v1/readiness
          port: 8080
      runtime.properties: |
         druid.service=druid/broker
         druid.log4j2.sourceCategory=druid/broker
         druid.broker.http.numConnections=5
         # Processing threads and buffers
         druid.processing.buffer.sizeBytes=268435456
         druid.processing.numMergeBuffers=1
         druid.processing.numThreads=4

    coordinators:
      druid.port: 8080
      kind: Deployment
      maxSurge: 2
      maxUnavailable: 0
      nodeType: coordinator
      nodeConfigMountPath: "/opt/druid/conf/druid/cluster/master/coordinator-overlord"
      podDisruptionBudgetSpec:
        maxUnavailable: 1
      replicas: 1
      resources:
        limits:
          cpu: 1000m
          memory: 1Gi
        requests:
          cpu: 500m
          memory: 1Gi
      livenessProbe:
        initialDelaySeconds: 60
        periodSeconds: 5
        failureThreshold: 3
        httpGet:
          path: /status/health
          port: 8080
      readinessProbe:
        initialDelaySeconds: 60
        periodSeconds: 5
        failureThreshold: 3
        httpGet:
          path: /status/health
          port: 8080
      env:
        - name: DRUID_XMS
          value: 1g 
        - name: DRUID_XMX
          value: 1g
        - name: AWS_REGION
          value: eu-west-1
      runtime.properties: |
          druid.service=druid/coordinator
          druid.log4j2.sourceCategory=druid/coordinator
          druid.indexer.runner.type=httpRemote
          druid.indexer.queue.startDelay=PT5S
          druid.coordinator.balancer.strategy=cachingCost
          druid.serverview.type=http
          druid.indexer.storage.type=metadata
          druid.coordinator.startDelay=PT10S
          druid.coordinator.period=PT5S
          druid.server.http.numThreads=5000
          druid.coordinator.asOverlord.enabled=true
          druid.coordinator.asOverlord.overlordService=druid/overlord

    historical:
      druid.port: 8080
      nodeType: historical
      nodeConfigMountPath: "/opt/druid/conf/druid/cluster/data/historical"
      replicas: 1
      livenessProbe:
        initialDelaySeconds: 1800
        periodSeconds: 5
        failureThreshold: 3
        httpGet:
          path: /status/health
          port: 8080
      readinessProbe:
        httpGet:
          path: /druid/historical/v1/readiness
          port: 8080
        periodSeconds: 10
        failureThreshold: 18
      resources:
        limits:
          cpu: 1000m
          memory: 12Gi
        requests:
          cpu: 1000m
          memory: 12Gi
      env:
        - name: DRUID_XMS
          value: 1500m
        - name: DRUID_XMX
          value: 1500m 
        - name: DRUID_MAXDIRECTMEMORYSIZE
          value: 12g
        - name: AWS_REGION
          value: eu-west-1
      runtime.properties: |
        druid.service=druid/historical
        druid.log4j2.sourceCategory=druid/historical
        # HTTP server threads
        druid.server.http.numThreads=10
        # Processing threads and buffers
        druid.processing.buffer.sizeBytes=536870912
        druid.processing.numMergeBuffers=1
        druid.processing.numThreads=2
        # Segment storage 
        druid.segmentCache.locations=[{\"path\":\"/opt/druid/data/historical/segments\",\"maxSize\": 10737418240}]
        druid.server.maxSize=10737418240
        # Query cache
        druid.historical.cache.useCache=true
        druid.historical.cache.populateCache=true
        druid.cache.type=caffeine
        druid.cache.sizeInBytes=256000000
      volumeClaimTemplates:
        -
          metadata:
            name: historical-volume
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 50Gi
            storageClassName: gp2
      volumeMounts:
        -
          mountPath: /opt/druid/data/historical
          name: historical-volume

    middlemanagers:
      druid.port: 8080
      nodeType: middleManager
      nodeConfigMountPath: /opt/druid/conf/druid/cluster/data/middleManager
      env:
        - name: DRUID_XMX
          value: 4096m
        - name: DRUID_XMS
          value: 4096m
        - name: AWS_REGION
          value: eu-west-1
        - name: AWS_DEFAULT_REGION
          value: eu-west-1
      replicas: 1
      resources:
        limits:
          cpu: 1000m
          memory: 6Gi
        requests:
          cpu: 1000m
          memory: 6Gi
      livenessProbe:
        initialDelaySeconds: 60
        periodSeconds: 5
        failureThreshold: 3
        httpGet:
          path: /status/health
          port: 8080
      readinessProbe:
        initialDelaySeconds: 60
        periodSeconds: 5
        failureThreshold: 3
        httpGet:
          path: /status/health
          port: 8080
      runtime.properties: |
        druid.service=druid/middleManager
        druid.worker.capacity=3
        druid.indexer.task.baseTaskDir=/opt/druid/data/middlemanager/task
        druid.indexer.runner.javaOpts=-server -XX:MaxDirectMemorySize=10240g -Duser.timezone=UTC -Daws.region=eu-west-1 -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/opt/druid/data/tmp -Dlog4j.debug -XX:+UnlockDiagnosticVMOptions -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1 -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=50 -XX:GCLogFileSize=10m -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:+UseG1GC -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager -XX:HeapDumpPath=/opt/druid/data/logs/peon.%t.%p.hprof -Xms10G -Xmx10G

        # HTTP server threads
        druid.server.http.numThreads=25
        # Processing threads and buffers on Peons
        druid.indexer.fork.property.druid.processing.numMergeBuffers=2
        druid.indexer.fork.property.druid.processing.buffer.sizeBytes=32000000
        druid.indexer.fork.property.druid.processing.numThreads=2
      volumeClaimTemplates:
        -
          metadata:
            name: middlemanagers-volume
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 50Gi
            storageClassName: gp2
      volumeMounts:
        -
          mountPath: /opt/druid/data/historical
          name: middlemanagers-volume

    routers:
      kind: Deployment
      nodeConfigMountPath: "/opt/druid/conf/druid/cluster/query/router"
      livenessProbe:
        initialDelaySeconds: 60
        periodSeconds: 5
        failureThreshold: 3
        httpGet:
          path: /status/health
          port: 8080
      readinessProbe:
        initialDelaySeconds: 60
        periodSeconds: 5
        failureThreshold: 3
        httpGet:
          path: /status/health
          port: 8080
      druid.port: 8080
      env:
        - name: AWS_REGION
          value: eu-west-1
        - name: AWS_DEFAULT_REGION
          value: eu-west-1
        - name: DRUID_XMX
          value: 1024m
        - name: DRUID_XMS
          value: 1024m
      resources:
        limits:
          cpu: 500m
          memory: 2Gi
        requests:
          cpu: 500m
          memory: 2Gi
      nodeType: router
      podDisruptionBudgetSpec:
        maxUnavailable: 1
      replicas: 1
      runtime.properties: |
          druid.service=druid/router
          druid.log4j2.sourceCategory=druid/router
          # HTTP proxy
          druid.router.http.numConnections=5000
          druid.router.http.readTimeout=PT5M
          druid.router.http.numMaxThreads=1000
          druid.server.http.numThreads=1000
          # Service discovery
          druid.router.defaultBrokerServiceName=druid/broker
          druid.router.coordinatorServiceName=druid/coordinator
          druid.router.managementProxy.enabled=true
      services:
        -
          metadata:
            name: router-%s-service
          spec:
            ports:
              -
                name: router-port
                port: 8080
            type: NodePort

```

- Steps to reproduce the problem
- Deploy the above using the latest operator version, to an EKS cluster
- Expose the router port using kubectl proxy:
```
$ kubectl port-forward service/router-druid-ewanstenant-routers-service 12345:8080 -n <yournamespace>
```
- Load the sample dataset, using the default settings

- The error message or stack traces encountered. Providing more context, such as nearby log messages or even entire logs, can be helpful.
```
{"ingestionStatsAndErrors":{"taskId":"index_parallel_wikipedia_pedgollm_2021-05-25T23:51:09.811Z","payload":{"ingestionState":"BUILD_SEGMENTS","unparseableEvents":{},"rowStats":{"determinePartitions":{"processed":24433,"processedWithError":0,"thrownAway":0,"unparseable":0},"buildSegments":{"processed":24433,"processedWithError":0,"thrownAway":0,"unparseable":0}},"errorMsg":"java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.IOException: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: DJQGKG8Z57V4R2MP; S3 Extended Request ID: IXmXtwpGLsf1mWTrU7sJLx/cM2Cg72GarKfbsAtpt763Wi62fft6odbo/jmQ2nZOJbS6hro0/QY=), S3 Extended Request ID: IXmXtwpGLsf1mWTrU7sJLx/cM2Cg72GarKfbsAtpt763Wi62fft6odbo/jmQ2nZOJbS6hro0/QY=\n\tat org.apache.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:938)\n\tat org.apache.druid.indexing.common.task.IndexTask.runTask(IndexTask.java:494)\n\tat org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:152)\n\tat org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask.runSequential(ParallelIndexSupervisorTask.java:964)\n\tat org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask.runTask(ParallelIndexSupervisorTask.java:445)\n\tat org.apache.druid.indexing.common.task.AbstractBatchIndexTask.run(AbstractBatchIndexTask.java:152)\n\tat org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:451)\n\tat org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:423)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)
```

- Any debugging that you have already done



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues connecting to S3 on EKS #11303

Affected Version

Description

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issues connecting to S3 on EKS #11303

Description

Affected Version

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions