1

I am used to map and starmap pool methods to distribute a FUNCTION on any kind of iterable object. Here is how I typically extract stem words from the raw content column of a pandas dataframe:

pool = mp.Pool(cpu_nb)
totalvocab_stemmed = pool.map(tokenize_and_stem, site_df["raw_content"])
pool.close()

a good article on function parallelization in python

So far so good. But is there a nice and easy way to parallelize the execution of sklearn METHODS. Here is an example of what I would like to distribute

tfidf_vectorizer = TfidfVectorizer(max_df=0.6, max_features=200000,
                             min_df=0.2, stop_words=stop_words,
                             use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(self.site_df["raw_content"])

tfidf_matrix is not an element by element list so splitting site_df["raw_content"] in as many elements as I have cores in my CPU to perform a GOF pool and stack everything back together later on is not an option. I saw some interesting options:

  • the IPython.parallel Client source
  • use the parallel_backend function of sklearn.externals.joblib as a context source

I might be dumb but I wasn't very successful in both attempts. How would you do this?

1
  • 1
    See stackoverflow.com/questions/28396957/… You can just parallelize the transforming process afterwards, but the fitting process needs to be one process I think. Commented Feb 17, 2019 at 16:48

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.