How do I optimize this code? How to make it fast. Can the subtraction be performed in the Spark Distributed space? Here Rdd is a collection of dictionaries
all_actors =["brad", "tom", "abc", "def"]
init_actors=["tom", "abc"]
for i in all_actors:
dc={}
d1=bj.filter(lambda x: x['actor']==i).first()
for j in init_actors:
d2=centroids.filter(lambda x: x['actor']==j).first()
dc={key: (d1[key] - d2[key])**2 for key in d1.keys() if key not in 'actor'}
val=sum([v for v in dc.values()])
val=math.sqrt(val)
rdd.take(2)
[{'actor': 'brad',
'good': 1,
'bad': 0,
'average': 0,}
{'actor': 'tom',
'good': 0,
'bad': 1,
'average': 1,}]
This Rdd has around 30,000 + keys in each dictionary. This is just a sample.
Expected Output:
Find the Euclidean distance between each row in RDD.
if key not in 'name'I think that might be a logic issueina string is probably not what you wanted. Either you check whether something is in a collection (list, set, dictionary) or you check j for equality (==). I don't understand your check as is.pyspark? Don't optimise what doesn't work.