I have a dataset consisting of large strings (extracted text from ~300 pptx files). By using pandas apply I am executing an "average" function on each string, the averaging looks up a corresponding word vector for every word, multiplies it with another vector and returns the average correlation.
However iterating and applying the function on large strings takes a lot of time, and I was wondering what approaches I could take to speed up the following code:
#retrieve word vector from words df
def vec(w):
return words.at[w]
#calculates the cosine distance between two vectors
def cosine_dist(a,b):
codi = 1 - spatial.distance.cosine(a, b)
return codi
#calculate the average cosine distance of the whole string and a given word vector
v_search = vec("test")
def Average(v_search, tobe_parsed):
word_total = 0
mean = 0
for word in tobe_parsed.split():
try: #word exists
cd = cosine_dist(vec(word), v_search)
mean += cd
word_total += 1
except: #word does not exists
pass
average = mean / word_total
return(average)
df['average'] = df['text'].apply(lambda x: average(x))
I've been looking into alternative ways of writing the code (e.g. df.loc -> df.at), cython and multithreading, but my time is limited so I don't want to waste too much time on a less effective approach.
Thanks in advance