2

I have to run a python script on EMR instance using pyspark to query dynamoDB. I am able to do that by querying dynamodb on pyspark which is executed by including jars with following command.

`pyspark --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hive.jar,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar`

I ran following python3 script to query data using pyspark python module.

import time
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, HiveContext

start_time = time.time()
SparkContext.setSystemProperty("hive.metastore.uris", "thrift://nn1:9083")
sparkSession = (SparkSession
                .builder
                .appName('example-pyspark-read-and-write-from-hive')
                .enableHiveSupport()
                .getOrCreate())
df_load = sparkSession.sql("SELECT * FROM example")
df_load.show()
print(time.time() - start_time)

Which caused following runtime exception for missing jars.

java.lang.ClassNotFoundException Class org.apache.hadoop.hive.dynamodb.DynamoDBSerDe not found

How do I convert the pyspark --jars.. to a pythonic equivalent.

As of now I tried copying the jars from the location /usr/share/... to $SPARK_HOME/libs/jars and adding that path to spark-defaults.conf external class path that had no effect.

1
  • I think this is what you are looking for. Commented Feb 8, 2019 at 16:03

1 Answer 1

3

Use spark-submit command to execute your python script. Example :

spark-submit --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hive.jar,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar script.py
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.