How to do parallel processing in pyspark

Question

I want to do parallel processing in for loop using pyspark.

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('yarn').appName('myAppName').getOrCreate()
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

data = [a,b,c]


for i in data:
    try:
        df = spark.read.parquet('gs://'+i+'-data')
        df.createOrReplaceTempView("people")
        df2=spark.sql("""select * from people """)
        df.show()
    except Exception as e:
        print(e)
        continue

Above mentioned script is working fine but i want to do parallel processing in pyspark and which is possible in scala

Does this answer your question? How to run multiple Spark jobs in parallel? — ernest_k
– ernest_k, Commented Jan 10, 2020 at 5:07
Does this answer your question? How to run independent transformations in parallel using PySpark? — user10938362
– user10938362, Commented Jan 10, 2020 at 10:26

Andy_101 · Accepted Answer · 2022-10-06 05:37:44Z

7

Spark itself runs job parallel but if you still want parallel execution in the code you can use simple python code for parallel processing to do it (this was tested on DataBricks Only link).

data = ["a","b","c"]

from multiprocessing.pool import ThreadPool
pool = ThreadPool(10)


def fun(x):
    try:
        df = sqlContext.createDataFrame([(1,2, x), (2,5, "b"), (5,6, "c"), (8,19, "d")], ("st","end", "ani"))
        df.show()
    except Exception as e:
        print(e)

pool.map( fun,data)

I have changed your code a bit but this is basically how you can run parallel tasks, If you have some flat files that you want to run parallel just make a list with their name and pass it into pool.map( fun,data).

Change the function fun as need be.

For more details on the multiprocessing module check the documentation.

Similarly, if you want to do it in Scala you will need the following modules

import scala.concurrent.{Future, Await}

For a more detailed understanding check this out. The code is for Databricks but with a few changes, it will work with your environment.

edited Oct 6, 2022 at 5:37

answered Jan 10, 2020 at 6:13

Andy_101

1,30810 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

thentangler Over a year ago

Will this bring it to the driver node? Or will it execute the parallel processing in the multiple worker nodes?

Andy_101 Over a year ago

this is parallel execution in the code not actuall parallel execution. this is simple python parallel Processign it dose not interfear with the Spark Parallelism.

vaudt Over a year ago

I also think this simply adds threads to the driver node. It doesn't send stuff to the worker nodes. I think Andy_101 is right.

Daniel Qiu Over a year ago

I actually tried this out, and it does run the jobs in parallel in worker nodes surprisingly, not just the driver! My experiment setup was using 200 executors, and running 2 jobs in series would take 20 mins, and running them in ThreadPool takes 10 mins in total.

vaudt Over a year ago

I think this does not work. Here's my sketch of proof. import socket from multiprocessing.pool import ThreadPool pool = ThreadPool(10) def getsock(i): s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) s.connect(("8.8.8.8", 80)) return s.getsockname()[0] list(pool.map(getsock,range(10))) This always gives the same IP address. Namely that of the driver. Hence we are not executing on the workers.

|

vaudt · Accepted Answer · 2022-10-09 15:22:18Z

1

Here's a parallel loop on pyspark using azure databricks.

import socket

def getsock(i):
  s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
  s.connect(("8.8.8.8", 80))
  return s.getsockname()[0]

rdd1 = sc.parallelize(list(range(10)))
parallel=rdd1.map(getsock).collect()

On other platforms than azure you'll maybe need to create the spark context sc. On azure the variable exists by default.

Coding it up like this only makes sense if in the code that is executed parallelly (getsock here) there is no code that is already parallel. For instance, had getsock contained code to go through a pyspark DataFrame then that code is already parallel. So, it would probably not make sense to also "parallelize" that loop.

edited Oct 9, 2022 at 15:22

answered Oct 9, 2022 at 15:13

vaudt

1816 bronze badges

Collectives™ on Stack Overflow

How to do parallel processing in pyspark

2 Answers 2

8 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related