0

Here is my code:

import pandas as pd
from nltk.corpus import wordnet

df = pd.DataFrame({'col_1': ['desk', 'apple', 'run']})
df['synset'] = df.col_1.apply(lambda x: wordnet.synsets(x))

The above code runs fairly slow on 4 core pc with 16 GB ram. I was hoping to speed up and run it on Google Cloud instance with 24 cores and 120 GB ram. And still was running slow (maybe twice as fast as before). And Google Console was showing that only 4.1 cores are utilized.

So I am curios: does Pandas runs computations for each row in parallel? If it does, then I am guessing nltk is a bottleneck here. Can anybody confirm or correct my guesses?

P.S. The above code is just a sample, real dataframe has 100k rows.

1 Answer 1

1

pandas does not parallelize apply. You should define a custom function that runs on each row instead of your lambda function, then use multiprocessing to work on that and resync it with your dataframe.

def my_func(i):
    #some work with i as index
    return (i,result)
from multiprocessing import Pool
pool = Pool(24)
res=pool.imap(my_func,df.index)
for t in res:
    df.set_value(t[0],"New column",t[1])
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.