1

I am trying to get Spark Connect working on Amazon EMR (Spark v3.5.1). I started the Connect server on EMR primary node, making sure the JARs required for S3 auth are present in the Classpath:

/usr/lib/spark/sbin/start-connect-server.sh \
  --conf spark.jars=/usr/lib/hadoop/hadoop-aws.jar,/usr/share/aws/aws-java-sdk/*,/usr/share/aws/aws-java-sdk-v2/* \
  --packages org.apache.spark:spark-connect_2.12:3.5.1

In our EMR setup, the instance profile role has limited access. Instead, we require our users to assume a use-case specific role which has access to use-case specific resources, before they can access any data in the Spark jobs. It is also why we don't use EMRFS - if we configure EMRFS authorization configuration to automatically assume a role based on S3 prefixes, the developers can unknowingly combine data from two different use-cases which violates our security principles.

Normally, when not using Spark Connect, we set the STS credentials using the SparkContext. Since SparkContext is not available when using Spark Connect, I am setting the credentials like this:

spark = (
   SparkSession.builder.appName("SparkConnectTest").remote("sc://localhost:21100")
   .config("spark.hadoop.fs.s3a.access.key", sts_creds["AccessKeyId"])
   .config("spark.hadoop.fs.s3a.secret.key", sts_creds["SecretAccessKey"])
   .config("spark.hadoop.fs.s3a.session.token", sts_creds["SessionToken"])
   .config("spark.hadoop.fs.s3a.endpoint", "")
   .config("spark.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
   .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
   .getOrCreate()
)

However, it seems these config params are not propagated to the server at all, and I get the AccessDeniedException when I try to access S3 resources.

SparkConnectGrpcException: (java.nio.file.AccessDeniedException) s3a://<s3-path>: getFileStatus on <s3-path>: software.amazon.awssdk.services.s3.model.S3Exception: null (Service: S3, Status Code: 403, Request ID: XXXXX, Extended Request ID: XXXXXXXXXXXX):null

I have verified the credentials are fine. In fact, if I set these same credentials on the server by using the --conf arguments to start-connect-server.sh, everything works as expected, and I am able to access the S3 resources on my client. Obviously, that's not a solution since STS creds are temporary, and we want the developers to provide use-case specific credentials from the client.

So the problem seems to be the config parameters not being propagated when creating a Spark Connect session.

1
  • did you find the solution for this? I am exactly facing the same issue as config params are not getting propagate and I can set the credentials at server side and it will be used in multitenant env. Commented Aug 14 at 12:13

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.