1

I have a dataset consisting of large strings (extracted text from ~300 pptx files). By using pandas apply I am executing an "average" function on each string, the averaging looks up a corresponding word vector for every word, multiplies it with another vector and returns the average correlation.

However iterating and applying the function on large strings takes a lot of time, and I was wondering what approaches I could take to speed up the following code:

#retrieve word vector from words df
def vec(w):
     return words.at[w]

#calculates the cosine distance between two vectors
def cosine_dist(a,b):
    codi = 1 - spatial.distance.cosine(a, b)
    return codi

#calculate the average cosine distance of the whole string and a given word vector
v_search = vec("test")
def Average(v_search, tobe_parsed):
    word_total = 0
    mean = 0
    for word in tobe_parsed.split():
        try: #word exists
            cd = cosine_dist(vec(word), v_search)
            mean += cd
            word_total += 1 

        except: #word does not exists    
            pass

    average = mean / word_total
    return(average)
df['average'] = df['text'].apply(lambda x: average(x))

I've been looking into alternative ways of writing the code (e.g. df.loc -> df.at), cython and multithreading, but my time is limited so I don't want to waste too much time on a less effective approach.

Thanks in advance

2 Answers 2

2

You need to leverage vectorization and numpy broadcasting. Make the pandas return list of word indices, use them to index the vocabulary array and create a matrix of word vectors (number of rows equals number of words) then you use broadcasting to compute cosine distances and compute it's mean.

Sign up to request clarification or add additional context in comments.

Comments

1

Thanks a lot vumaasha! That was indeed the way to go (speed increase from ~15 min to ~7 sec! :o)

basically the code has been rewritten to:

def Average(v_search,text):
        wordvec_matrix = words.loc[text.split()]
        return np.sum(cos_cdist(wordvec_matrix,v_search))/wordvec_matrix.shape[0]
df['average'] = df['text'].apply(lambda x: average(x))

2 Comments

you could also use np.mean
Obviously.. :s Thanks vumaasha

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.