How to run multiple concurrent jobs in Spark using python multiprocessing

Question

I have setup a Spark on YARN cluster on my laptop, and have problem running multiple concurrent jobs in Spark, using python multiprocessing. I am running on yarn-client mode. I tried two ways to achieve this:

Setup a single SparkContext and create multiple processes to submit jobs. This method does not work, and the program crashes. I guess a single SparkContext does not support python multiple processes
For each process, setup a SparkContext and submit the job. In this case, the job is submitted successfully to YARN, but the jobs are run serially, only one job is run at a time while the rest are in queue. Is it possible to start multiple jobs concurrently?

Update on the settings

YARN:
yarn.nodemanager.resource.cpu-vcores 8
yarn.nodemanager.resource.memory-mb 11264
yarn.scheduler.maximum-allocation-vcores 1

Spark:
SPARK_EXECUTOR_CORES=1
SPARK_EXECUTOR_INSTANCES=2
SPARK_DRIVER_MEMORY=1G
spark.scheduler.mode = FAIR
spark.dynamicAllocation.enabled = true
spark.shuffle.service.enabled = true

yarn will only run one job at a time, using 3 containers, 3 vcores, 3GB ram. So there are ample vcores and rams available for the other jobs, but they are not running

MrE · Accepted Answer · 2015-11-10 01:37:28Z

1

How many CPUs do you have and how many are required per job? YARN will schedule the jobs and assign what it can on your cluster: if you require 8CPUs for your job and your system has only 8CPUs, then other jobs will be queued and ran serially.

If you requested 4 per job then you would see 2 jobs run in parallel at any one time.

answered Nov 10, 2015 at 1:37

MrE

21.1k15 gold badges92 silver badges113 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Michael Over a year ago

how do I request 4 cores per job for yarn? I have posted my settings above, can you have a look and see if it makes sense? Thank you!

MrE Over a year ago

you can pass --total-executor-cores 4 to the submit job to limit / increase the number of cores to use on the cluster

Community · Accepted Answer · 2017-05-23 11:51:39Z

0

I found the solution https://stackoverflow.com/a/33012538/957352

For single machine cluster,

In the file

/etc/hadoop/conf/capacity-scheduler.xml

changed the property

yarn.scheduler.capacity.maximum-am-resource-percent from 0.1 to 0.5.

edited May 23, 2017 at 11:51

CommunityBot

11 silver badge

answered Nov 10, 2015 at 8:27

Michael

1,4386 gold badges29 silver badges40 bronze badges

Comments

patrick · Accepted Answer · 2019-08-22 03:53:17Z

0

I meet the same question as you, and I solved it with setting .config("spark.executor.cores", '1') in pyspark. here is my code :

import os,sys
import numpy as np
import pyspark
from multiprocessing import Pool
from pyspark.sql import SparkSession
import time
def train(db):

    print(db)
    spark = SparkSession \
        .builder \
        .appName("scene_"+str(db)) \
        .config("spark.executor.cores", '1') \
        .getOrCreate()
    print(spark.createDataFrame([[1.0],[2.0]],['test_column']).collect())

if __name__ == '__main__':
    p = Pool(10)
    for db in range(10):
        p.apply_async(train,args=(db,))    
    p.close()
    p.join()
    #train(4)

answered Aug 22, 2019 at 3:53

patrick

1

Collectives™ on Stack Overflow

How to run multiple concurrent jobs in Spark using python multiprocessing

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related