I am trying to get Spark Connect working on Amazon EMR (Spark v3.5.1). I started the Connect server on EMR primary node, making sure the JARs required for S3 auth are present in the Classpath:
/usr/lib/spark/sbin/start-connect-server.sh \
--conf spark.jars=/usr/lib/hadoop/hadoop-aws.jar,/usr/share/aws/aws-java-sdk/*,/usr/share/aws/aws-java-sdk-v2/* \
--packages org.apache.spark:spark-connect_2.12:3.5.1
In our EMR setup, the instance profile role has limited access. Instead, we require our users to assume a use-case specific role which has access to use-case specific resources, before they can access any data in the Spark jobs. It is also why we don't use EMRFS - if we configure EMRFS authorization configuration to automatically assume a role based on S3 prefixes, the developers can unknowingly combine data from two different use-cases which violates our security principles.
Normally, when not using Spark Connect, we set the STS credentials using the SparkContext. Since SparkContext is not available when using Spark Connect, I am setting the credentials like this:
spark = (
SparkSession.builder.appName("SparkConnectTest").remote("sc://localhost:21100")
.config("spark.hadoop.fs.s3a.access.key", sts_creds["AccessKeyId"])
.config("spark.hadoop.fs.s3a.secret.key", sts_creds["SecretAccessKey"])
.config("spark.hadoop.fs.s3a.session.token", sts_creds["SessionToken"])
.config("spark.hadoop.fs.s3a.endpoint", "")
.config("spark.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.getOrCreate()
)
However, it seems these config params are not propagated to the server at all, and I get the AccessDeniedException when I try to access S3 resources.
SparkConnectGrpcException: (java.nio.file.AccessDeniedException) s3a://<s3-path>: getFileStatus on <s3-path>: software.amazon.awssdk.services.s3.model.S3Exception: null (Service: S3, Status Code: 403, Request ID: XXXXX, Extended Request ID: XXXXXXXXXXXX):null
I have verified the credentials are fine. In fact, if I set these same credentials on the server by using the --conf arguments to start-connect-server.sh, everything works as expected, and I am able to access the S3 resources on my client. Obviously, that's not a solution since STS creds are temporary, and we want the developers to provide use-case specific credentials from the client.
So the problem seems to be the config parameters not being propagated when creating a Spark Connect session.