0

How do I optimize this code? How to make it fast. Can the subtraction be performed in the Spark Distributed space? Here Rdd is a collection of dictionaries

all_actors =["brad", "tom", "abc", "def"]
init_actors=["tom", "abc"]

for i in all_actors:

        dc={}
        d1=bj.filter(lambda x: x['actor']==i).first()
        for j in init_actors:
            d2=centroids.filter(lambda x: x['actor']==j).first()
            dc={key: (d1[key] - d2[key])**2 for key in d1.keys() if key not in 'actor'}
            val=sum([v for v in dc.values()])
            val=math.sqrt(val)

rdd.take(2)

[{'actor': 'brad',
  'good': 1,
  'bad': 0,
  'average': 0,}
 {'actor': 'tom',
  'good': 0,
  'bad': 1,
  'average': 1,}]

This Rdd has around 30,000 + keys in each dictionary. This is just a sample.

Expected Output:

Find the Euclidean distance between each row in RDD.

6
  • if key not in 'name' I think that might be a logic issue Commented Apr 20, 2018 at 17:48
  • Sorry, It was 'actor' updated it Commented Apr 20, 2018 at 17:53
  • I'd still say the same. in a string is probably not what you wanted. Either you check whether something is in a collection (list, set, dictionary) or you check j for equality (==). I don't understand your check as is. Commented Apr 20, 2018 at 17:56
  • Does this code give you the desired output outside of pyspark? Don't optimise what doesn't work. Commented Apr 20, 2018 at 18:08
  • This code gives me the desired output in pyspark. But it’s taking lot of time and memory. Commented Apr 20, 2018 at 18:11

1 Answer 1

1

I understand that you need all distances between elements from all_actors with all from init_actors

I think yous should do cartesian product and then make map to get all distances.

all_actors =["brad", "tom", "abc", "def"]
init_actors=["tom", "abc"]

# Create cartesian product of tables
d1=bj.filter(lambda x: x['actor'] in all_actors)
d2=centroids.filter(lambda x: x['actor'] in init_actors)
combinations = d1.cartesian(d2)

Then you just apply map function that calculates distance (I am not sure what layout cartesian result has so you have to figure out how calculate_cartesian should look).

combinations.map(calculate_euclidean)       

Edit: I googled that cartesian produces rows of pairs (x,y) - x and y are same type as elements of all/init_actors - so you can just create function:

def calculate_euclidean(x, y):
    dc={key: (x[key] - y[key])**2 for key in x.keys() if key not in 'actor'}
    val=sum([v for v in dc.values()])
    val=math.sqrt(val)

    #returning dict, but you can change result row layout if you want
    return {'value': val,
            'actor1': x['actor']
            'actor2': y['actor']}

All distance calculating operations are distributed so it should run much, much faster.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.