7

In order to write a standalone script, I would like to start and configure a Spark context directly from Python. Using PySpark's script I can set the driver's memory size with:

$ /opt/spark-1.6.1/bin/pyspark
... INFO MemoryStore: MemoryStore started with capacity 511.5 MB ...
$ /opt/spark-1.6.1/bin/pyspark --conf spark.driver.memory=10g
... INFO MemoryStore: MemoryStore started with capacity 7.0 GB ...

But when starting the context from the Python module, the driver's memory size cannot be set:

$ export SPARK_HOME=/opt/spark-1.6.1                                                                                                                                                                                                                                                                                                                
$ export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python
$ python
>>> from pyspark import SparkConf, SparkContext
>>> sc = SparkContext(conf=SparkConf().set('spark.driver.memory', '10g'))
... INFO MemoryStore: MemoryStore started with capacity 511.5 MB ...

The only solution I know is to set spark.driver.memory in sparks-default.conf, which is not satisfactory. As explained in this post, it makes sense for Java/Scala not to able able to change the driver's memory size once the JVM is started. Is there any way to somehow configure it dynamically from Python before or when importing the pyspark module?

3 Answers 3

12

There is no point in using the conf as you are doing. Try to add this preamble to your code:

memory = '10g'
pyspark_submit_args = ' --driver-memory ' + memory + ' pyspark-shell'
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
Sign up to request clarification or add additional context in comments.

Comments

3

I had this exact same problem and just figured out a hacky way to do it. And it turns out there is an existing answer which takes the same approach. But I'm going to explain why it works.

As you know, the driver-memory cannot be set after the JVM starts. But when creating a SparkContext, pyspark starts the JVM by calling spark-submit and passing in pyspark-shell as the command

SPARK_HOME = os.environ["SPARK_HOME"]
# Launch the Py4j gateway using Spark's run command so that we pick up the
# proper classpath and settings from spark-env.sh
on_windows = platform.system() == "Windows"
script = "./bin/spark-submit.cmd" if on_windows else "./bin/spark-submit"
submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "pyspark-shell")
if os.environ.get("SPARK_TESTING"):
   submit_args = ' '.join([
        "--conf spark.ui.enabled=false",
        submit_args
    ])
command = [os.path.join(SPARK_HOME, script)] + shlex.split(submit_args)

Notice the PYSPARK_SUBMIT_ARGS environment variable. These are the arguments that the context will send to the spark-submit command.

So as long as you set PYSPARK_SUBMIT_ARGS="--driver-memory=2g pyspark-shell" before you instantiate a new SparkContext, the driver memory setting should take effect. There are multiple ways to set this environment variable, see the answer I linked earlier for one method.

2 Comments

Thanks a lot for explaining how it works with the actual code! I accepted the other answer as it was working and posted earlier.
@udscbt No worries. I was really excited when I finally figured it out and was going to post my own question/answer (the people gots to know!) when this one popped up... wish I would have found it sooner. All the other questions I found just said "send --driver-memory to spark-submit", but I wasn't using spark-submit (so I thought).
0

you can pass it through the spark-submit command using --driver-memory flag.

spark-submit   \
    --master yarn  \
    --deploy-mode cluster  \
    --driver-cores 12 --driver-memory 20g \
    --num-executors 52 --executor-cores 6  --executor-memory 30g MySparkApp.py

Have this above command in a shell script or other and instead of 20 (the drivers memory which is set manually) have a variable which you can dynamically change.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.