-
Notifications
You must be signed in to change notification settings - Fork 347
Description
So I've been trying to set up an EMR cluster with this library and been encountering issues along the way. I'm able to get this to run locally with no issues, but once I deploy to EMR I get issues. I first tried setting the keys in the hadoop configuration via
val credentials = new DefaultAWSCredentialsProviderChain().getCredentials
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", credentials.getAWSAccessKeyId)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", credentials.getAWSSecretKey)This works locally, but on emr I was getting the following error:
16/09/27 23:32:41 WARN Utils$: An error occurred while trying to read the S3 bucket lifecycle configuration
com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: 9E47D0A340D2E04A)
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1389)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:902)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
at com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
at com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3778)
at com.amazonaws.services.s3.AmazonS3Client.getBucketLifecycleConfiguration(AmazonS3Client.java:1925)
at com.amazonaws.services.s3.AmazonS3Client.getBucketLifecycleConfiguration(AmazonS3Client.java:1912)
at com.databricks.spark.redshift.Utils$.checkThatBucketHasObjectLifecycleConfiguration(Utils.scala:127)
at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:90)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$8.apply(DataSourceStrategy.scala:260)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$8.apply(DataSourceStrategy.scala:260)
After trying to figure this out with no luck, I decided to try out doing the aws credentials using the IAM instance profiles as described in the README. after setting up my code to use IAM instance profiles, using https://github.com/databricks/spark-redshift/blob/master/src/it/scala/com/databricks/spark/redshift/STSIntegrationSuite.scala as a reference, I got the following error when running locally:
Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:70)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:80)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.$Proxy11.initialize(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:326)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at com.databricks.spark.redshift.Utils$.assertThatFileSystemIsNotS3BlockFileSystem(Utils.scala:156)
at com.databricks.spark.redshift.RedshiftRelation.<init>(RedshiftRelation.scala:52)
at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:49)
After settings the keys in the hadoop configuration I got it to work locally, but when running on EMR I get the same error when I just set the keys through the hadoop configuration.
So it seems that the keys need to be set for the hadoop configuration regardless if the other methods of providing the aws credentials are done.