1

I've been trying to partition and write a spark dataframe to S3 and I get an error.

df.write.partitionBy("year","month").mode("append")\
    .parquet('s3a://bucket_name/test_folder/')

Error message is:

Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: 
Status Code: 403, AWS Service: Amazon S3, AWS Request ID: xxxxxx, 
AWS Error Code: SignatureDoesNotMatch, 
AWS Error Message: The request signature we calculated does not match the signature you provided. Check your key and signing method.

However, when I simply write without partitioning it does work.

df.write.mode("append").parquet('s3a://bucket_name/test_folder/')

What could be causing this problem?

1 Answer 1

3

I resolved this problem by upgrading from aws-java-sdk:1.7.4 to aws-java-sdk:1.11.199 and hadoop-aws:2.7.7 to hadoop-aws:3.0.0 in my spark-submit.

I set this in my python file using:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.11.199,org.apache.hadoop:hadoop-aws:3.0.0 pyspark-shell

But you can also provide them as arguments to spark-submit directly.

I had to rebuild Spark providing my own version of Hadoop 3.0.0 to avoid dependency conflicts.

You can read some of my speculation as to the root cause here: https://stackoverflow.com/a/51917228/10239681

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.