Optimize Pyspark code to run fast

Question

How do I optimize this code? How to make it fast. Can the subtraction be performed in the Spark Distributed space? Here Rdd is a collection of dictionaries

all_actors =["brad", "tom", "abc", "def"]
init_actors=["tom", "abc"]

for i in all_actors:

        dc={}
        d1=bj.filter(lambda x: x['actor']==i).first()
        for j in init_actors:
            d2=centroids.filter(lambda x: x['actor']==j).first()
            dc={key: (d1[key] - d2[key])**2 for key in d1.keys() if key not in 'actor'}
            val=sum([v for v in dc.values()])
            val=math.sqrt(val)

rdd.take(2)

[{'actor': 'brad',
  'good': 1,
  'bad': 0,
  'average': 0,}
 {'actor': 'tom',
  'good': 0,
  'bad': 1,
  'average': 1,}]

This Rdd has around 30,000 + keys in each dictionary. This is just a sample.

Expected Output:

Find the Euclidean distance between each row in RDD.

I'd still say the same. in a string is probably not what you wanted. Either you check whether something is in a collection (list, set, dictionary) or you check j for equality (==). I don't understand your check as is. — roganjosh
– roganjosh, Commented Apr 20, 2018 at 17:56
Does this code give you the desired output outside of pyspark? Don't optimise what doesn't work. — roganjosh
– roganjosh, Commented Apr 20, 2018 at 18:08
This code gives me the desired output in pyspark. But it’s taking lot of time and memory. — Jerry George
– Jerry George, Commented Apr 20, 2018 at 18:11

Quilir · Accepted Answer · 2018-04-20 19:13:11Z

I understand that you need all distances between elements from all_actors with all from init_actors

I think yous should do cartesian product and then make map to get all distances.

all_actors =["brad", "tom", "abc", "def"]
init_actors=["tom", "abc"]

# Create cartesian product of tables
d1=bj.filter(lambda x: x['actor'] in all_actors)
d2=centroids.filter(lambda x: x['actor'] in init_actors)
combinations = d1.cartesian(d2)

Then you just apply map function that calculates distance (I am not sure what layout cartesian result has so you have to figure out how calculate_cartesian should look).

combinations.map(calculate_euclidean)

Edit: I googled that cartesian produces rows of pairs (x,y) - x and y are same type as elements of all/init_actors - so you can just create function:

def calculate_euclidean(x, y):
    dc={key: (x[key] - y[key])**2 for key in x.keys() if key not in 'actor'}
    val=sum([v for v in dc.values()])
    val=math.sqrt(val)

    #returning dict, but you can change result row layout if you want
    return {'value': val,
            'actor1': x['actor']
            'actor2': y['actor']}

All distance calculating operations are distributed so it should run much, much faster.

Collectives™ on Stack Overflow

Optimize Pyspark code to run fast

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related