pandas and parallel computations and external library

Question

Here is my code:

import pandas as pd
from nltk.corpus import wordnet

df = pd.DataFrame({'col_1': ['desk', 'apple', 'run']})
df['synset'] = df.col_1.apply(lambda x: wordnet.synsets(x))

The above code runs fairly slow on 4 core pc with 16 GB ram. I was hoping to speed up and run it on Google Cloud instance with 24 cores and 120 GB ram. And still was running slow (maybe twice as fast as before). And Google Console was showing that only 4.1 cores are utilized.

So I am curios: does Pandas runs computations for each row in parallel? If it does, then I am guessing nltk is a bottleneck here. Can anybody confirm or correct my guesses?

P.S. The above code is just a sample, real dataframe has 100k rows.

baloo · Accepted Answer · 2017-07-13 22:23:06Z

1

pandas does not parallelize apply. You should define a custom function that runs on each row instead of your lambda function, then use multiprocessing to work on that and resync it with your dataframe.

def my_func(i):
    #some work with i as index
    return (i,result)
from multiprocessing import Pool
pool = Pool(24)
res=pool.imap(my_func,df.index)
for t in res:
    df.set_value(t[0],"New column",t[1])

answered Jul 13, 2017 at 22:23

baloo

5271 gold badge5 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pandas and parallel computations and external library

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related