Set driver's memory size programmatically in PySpark

Question

In order to write a standalone script, I would like to start and configure a Spark context directly from Python. Using PySpark's script I can set the driver's memory size with:

$ /opt/spark-1.6.1/bin/pyspark
... INFO MemoryStore: MemoryStore started with capacity 511.5 MB ...
$ /opt/spark-1.6.1/bin/pyspark --conf spark.driver.memory=10g
... INFO MemoryStore: MemoryStore started with capacity 7.0 GB ...

But when starting the context from the Python module, the driver's memory size cannot be set:

$ export SPARK_HOME=/opt/spark-1.6.1                                                                                                                                                                                                                                                                                                                
$ export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python
$ python
>>> from pyspark import SparkConf, SparkContext
>>> sc = SparkContext(conf=SparkConf().set('spark.driver.memory', '10g'))
... INFO MemoryStore: MemoryStore started with capacity 511.5 MB ...

The only solution I know is to set spark.driver.memory in sparks-default.conf, which is not satisfactory. As explained in this post, it makes sense for Java/Scala not to able able to change the driver's memory size once the JVM is started. Is there any way to somehow configure it dynamically from Python before or when importing the pyspark module?

Enrico D' Urso · Accepted Answer · 2016-09-22 16:46:58Z

12

There is no point in using the conf as you are doing. Try to add this preamble to your code:

memory = '10g'
pyspark_submit_args = ' --driver-memory ' + memory + ' pyspark-shell'
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

answered Sep 22, 2016 at 16:46

Enrico D' Urso

1962 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 11:47:14Z

3

I had this exact same problem and just figured out a hacky way to do it. And it turns out there is an existing answer which takes the same approach. But I'm going to explain why it works.

As you know, the driver-memory cannot be set after the JVM starts. But when creating a SparkContext, pyspark starts the JVM by calling spark-submit and passing in pyspark-shell as the command

SPARK_HOME = os.environ["SPARK_HOME"]
# Launch the Py4j gateway using Spark's run command so that we pick up the
# proper classpath and settings from spark-env.sh
on_windows = platform.system() == "Windows"
script = "./bin/spark-submit.cmd" if on_windows else "./bin/spark-submit"
submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "pyspark-shell")
if os.environ.get("SPARK_TESTING"):
   submit_args = ' '.join([
        "--conf spark.ui.enabled=false",
        submit_args
    ])
command = [os.path.join(SPARK_HOME, script)] + shlex.split(submit_args)

Notice the PYSPARK_SUBMIT_ARGS environment variable. These are the arguments that the context will send to the spark-submit command.

So as long as you set PYSPARK_SUBMIT_ARGS="--driver-memory=2g pyspark-shell" before you instantiate a new SparkContext, the driver memory setting should take effect. There are multiple ways to set this environment variable, see the answer I linked earlier for one method.

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Sep 22, 2016 at 21:24

FGreg

15.6k10 gold badges74 silver badges114 bronze badges

2 Comments

udscbt Over a year ago

Thanks a lot for explaining how it works with the actual code! I accepted the other answer as it was working and posted earlier.

FGreg Over a year ago

@udscbt No worries. I was really excited when I finally figured it out and was going to post my own question/answer (the people gots to know!) when this one popped up... wish I would have found it sooner. All the other questions I found just said "send --driver-memory to spark-submit", but I wasn't using spark-submit (so I thought).

avrsanjay · Accepted Answer · 2016-09-22 18:12:13Z

0

you can pass it through the spark-submit command using --driver-memory flag.

spark-submit   \
    --master yarn  \
    --deploy-mode cluster  \
    --driver-cores 12 --driver-memory 20g \
    --num-executors 52 --executor-cores 6  --executor-memory 30g MySparkApp.py

Have this above command in a shell script or other and instead of 20 (the drivers memory which is set manually) have a variable which you can dynamically change.

answered Sep 22, 2016 at 18:12

avrsanjay

8057 silver badges12 bronze badges

Collectives™ on Stack Overflow

Set driver's memory size programmatically in PySpark

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related