2

I have setup a Spark on YARN cluster on my laptop, and have problem running multiple concurrent jobs in Spark, using python multiprocessing. I am running on yarn-client mode. I tried two ways to achieve this:

  • Setup a single SparkContext and create multiple processes to submit jobs. This method does not work, and the program crashes. I guess a single SparkContext does not support python multiple processes
  • For each process, setup a SparkContext and submit the job. In this case, the job is submitted successfully to YARN, but the jobs are run serially, only one job is run at a time while the rest are in queue. Is it possible to start multiple jobs concurrently?

    Update on the settings

    YARN:

  • yarn.nodemanager.resource.cpu-vcores 8

  • yarn.nodemanager.resource.memory-mb 11264
  • yarn.scheduler.maximum-allocation-vcores 1

    Spark:

  • SPARK_EXECUTOR_CORES=1

  • SPARK_EXECUTOR_INSTANCES=2
  • SPARK_DRIVER_MEMORY=1G
  • spark.scheduler.mode = FAIR
  • spark.dynamicAllocation.enabled = true
  • spark.shuffle.service.enabled = true

yarn will only run one job at a time, using 3 containers, 3 vcores, 3GB ram. So there are ample vcores and rams available for the other jobs, but they are not running

3 Answers 3

1

How many CPUs do you have and how many are required per job? YARN will schedule the jobs and assign what it can on your cluster: if you require 8CPUs for your job and your system has only 8CPUs, then other jobs will be queued and ran serially.

If you requested 4 per job then you would see 2 jobs run in parallel at any one time.

Sign up to request clarification or add additional context in comments.

2 Comments

how do I request 4 cores per job for yarn? I have posted my settings above, can you have a look and see if it makes sense? Thank you!
you can pass --total-executor-cores 4 to the submit job to limit / increase the number of cores to use on the cluster
0

I found the solution https://stackoverflow.com/a/33012538/957352

For single machine cluster,

In the file

/etc/hadoop/conf/capacity-scheduler.xml

changed the property

yarn.scheduler.capacity.maximum-am-resource-percent from 0.1 to 0.5.

Comments

0

I meet the same question as you, and I solved it with setting .config("spark.executor.cores", '1') in pyspark. here is my code :

import os,sys
import numpy as np
import pyspark
from multiprocessing import Pool
from pyspark.sql import SparkSession
import time
def train(db):

    print(db)
    spark = SparkSession \
        .builder \
        .appName("scene_"+str(db)) \
        .config("spark.executor.cores", '1') \
        .getOrCreate()
    print(spark.createDataFrame([[1.0],[2.0]],['test_column']).collect())

if __name__ == '__main__':
    p = Pool(10)
    for db in range(10):
        p.apply_async(train,args=(db,))    
    p.close()
    p.join()
    #train(4)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.