Skip to content

AWS Keys in hadoop conf required when using IAM instance profiles. #276

@a-morales

Description

@a-morales

So I've been trying to set up an EMR cluster with this library and been encountering issues along the way. I'm able to get this to run locally with no issues, but once I deploy to EMR I get issues. I first tried setting the keys in the hadoop configuration via

  val credentials = new DefaultAWSCredentialsProviderChain().getCredentials
  sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", credentials.getAWSAccessKeyId)
  sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", credentials.getAWSSecretKey)

This works locally, but on emr I was getting the following error:

16/09/27 23:32:41 WARN Utils$: An error occurred while trying to read the S3 bucket lifecycle configuration
com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: 9E47D0A340D2E04A)
    at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1389)
    at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:902)
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
    at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
    at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3778)
    at com.amazonaws.services.s3.AmazonS3Client.getBucketLifecycleConfiguration(AmazonS3Client.java:1925)
    at com.amazonaws.services.s3.AmazonS3Client.getBucketLifecycleConfiguration(AmazonS3Client.java:1912)
    at com.databricks.spark.redshift.Utils$.checkThatBucketHasObjectLifecycleConfiguration(Utils.scala:127)
    at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:90)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$8.apply(DataSourceStrategy.scala:260)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$8.apply(DataSourceStrategy.scala:260)

After trying to figure this out with no luck, I decided to try out doing the aws credentials using the IAM instance profiles as described in the README. after setting up my code to use IAM instance profiles, using https://github.com/databricks/spark-redshift/blob/master/src/it/scala/com/databricks/spark/redshift/STSIntegrationSuite.scala as a reference, I got the following error when running locally:

Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
        at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:70)
        at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:80)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at org.apache.hadoop.fs.s3native.$Proxy11.initialize(Unknown Source)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:326)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
        at com.databricks.spark.redshift.Utils$.assertThatFileSystemIsNotS3BlockFileSystem(Utils.scala:156)
        at com.databricks.spark.redshift.RedshiftRelation.<init>(RedshiftRelation.scala:52)
        at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:49)

After settings the keys in the hadoop configuration I got it to work locally, but when running on EMR I get the same error when I just set the keys through the hadoop configuration.

So it seems that the keys need to be set for the hadoop configuration regardless if the other methods of providing the aws credentials are done.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions