Speeding up Python NLP text parsing

Question

I have a dataset consisting of large strings (extracted text from ~300 pptx files). By using pandas apply I am executing an "average" function on each string, the averaging looks up a corresponding word vector for every word, multiplies it with another vector and returns the average correlation.

However iterating and applying the function on large strings takes a lot of time, and I was wondering what approaches I could take to speed up the following code:

#retrieve word vector from words df
def vec(w):
     return words.at[w]

#calculates the cosine distance between two vectors
def cosine_dist(a,b):
    codi = 1 - spatial.distance.cosine(a, b)
    return codi

#calculate the average cosine distance of the whole string and a given word vector
v_search = vec("test")
def Average(v_search, tobe_parsed):
    word_total = 0
    mean = 0
    for word in tobe_parsed.split():
        try: #word exists
            cd = cosine_dist(vec(word), v_search)
            mean += cd
            word_total += 1 

        except: #word does not exists    
            pass

    average = mean / word_total
    return(average)
df['average'] = df['text'].apply(lambda x: average(x))

I've been looking into alternative ways of writing the code (e.g. df.loc -> df.at), cython and multithreading, but my time is limited so I don't want to waste too much time on a less effective approach.

Thanks in advance

vumaasha · Accepted Answer · 2018-03-11 12:02:26Z

2

You need to leverage vectorization and numpy broadcasting. Make the pandas return list of word indices, use them to index the vocabulary array and create a matrix of word vectors (number of rows equals number of words) then you use broadcasting to compute cosine distances and compute it's mean.

answered Mar 11, 2018 at 12:02

vumaasha

2,8654 gold badges29 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Patty · Accepted Answer · 2018-03-12 10:18:47Z

1

Thanks a lot vumaasha! That was indeed the way to go (speed increase from ~15 min to ~7 sec! :o)

basically the code has been rewritten to:

def Average(v_search,text):
        wordvec_matrix = words.loc[text.split()]
        return np.sum(cos_cdist(wordvec_matrix,v_search))/wordvec_matrix.shape[0]
df['average'] = df['text'].apply(lambda x: average(x))

answered Mar 12, 2018 at 10:18

Patty

657 bronze badges

2 Comments

vumaasha Over a year ago

you could also use np.mean

Patty Over a year ago

Obviously.. :s Thanks vumaasha

Collectives™ on Stack Overflow

Speeding up Python NLP text parsing

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related