AWS Keys in hadoop conf required when using IAM instance profiles.

So I've been trying to set up an EMR cluster with this library and been encountering issues along the way. I'm able to get this to run locally with no issues, but once I deploy to EMR I get issues. I first tried setting the keys in the hadoop configuration via

``` scala
  val credentials = new DefaultAWSCredentialsProviderChain().getCredentials
  sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", credentials.getAWSAccessKeyId)
  sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", credentials.getAWSSecretKey)
```

This works locally, but on emr I was getting the following error:

```
16/09/27 23:32:41 WARN Utils$: An error occurred while trying to read the S3 bucket lifecycle configuration
com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: 9E47D0A340D2E04A)
    at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1389)
    at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:902)
    at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
    at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
    at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3778)
    at com.amazonaws.services.s3.AmazonS3Client.getBucketLifecycleConfiguration(AmazonS3Client.java:1925)
    at com.amazonaws.services.s3.AmazonS3Client.getBucketLifecycleConfiguration(AmazonS3Client.java:1912)
    at com.databricks.spark.redshift.Utils$.checkThatBucketHasObjectLifecycleConfiguration(Utils.scala:127)
    at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:90)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$8.apply(DataSourceStrategy.scala:260)
    at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$8.apply(DataSourceStrategy.scala:260)
```

After trying to figure this out with no luck, I decided to try out doing the aws credentials using the IAM instance profiles as described in the README. after setting up my code to use IAM instance profiles, using https://github.com/databricks/spark-redshift/blob/master/src/it/scala/com/databricks/spark/redshift/STSIntegrationSuite.scala as a reference, I got the following error when running locally:

```
Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
        at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:70)
        at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:80)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at org.apache.hadoop.fs.s3native.$Proxy11.initialize(Unknown Source)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:326)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
        at com.databricks.spark.redshift.Utils$.assertThatFileSystemIsNotS3BlockFileSystem(Utils.scala:156)
        at com.databricks.spark.redshift.RedshiftRelation.<init>(RedshiftRelation.scala:52)
        at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:49)
```

After settings the keys in the hadoop configuration I got it to work locally, but when running on EMR I get the same error when I just set the keys through the hadoop configuration.

So it seems that the keys need to be set for the hadoop configuration regardless if the other methods of providing the aws credentials are done. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AWS Keys in hadoop conf required when using IAM instance profiles. #276

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AWS Keys in hadoop conf required when using IAM instance profiles. #276

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions