I have a large list of http user agent strings (taken from a pandas dataframe) that I am trying to parse using the python implementation of ua-parser. I can parse the list fine when only using a single thread, but based on some preliminary speed testing, it'd take me well over 10 hours to run the whole dataset.
I am trying to use pool.map() to decrease processing time but can't quite seem to figure out how to get it to work. I've read about a dozen 'tutorials' that I found online and have searched SO (likely a duplicate of some sort, as there are a lot of similar questions), but none of the dozens of attempts have worked for one reason or another. I'm assuming/hoping it's an easy fix.
Here is what I have so far:
from ua_parser import user_agent_parser
http_str = df['user_agents'].tolist()
def uaparse(http_str):
for i, item in enumerate(http_str):
return user_agent_parser.Parse(http_str[i])
pool = mp.Pool(processes=10)
parsed = pool.map(uaparse, range(0,len(http_str))
Right now I'm seeing the following error message:
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-25-701fbf58d263> in <module>() 7 8 pool = mp.Pool(processes=10) ----> 9 results = pool.map(uaparse, range(0,len(http_str))) /home/ubuntu/anaconda/lib/python2.7/multiprocessing/pool.pyc in map(self, func, iterable, chunksize) 249 ''' 250 assert self._state == RUN --> 251 return self.map_async(func, iterable, chunksize).get() 252 253 def imap(self, func, iterable, chunksize=1): /home/ubuntu/anaconda/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout) 565 return self._value 566 else: --> 567 raise self._value 568 569 def _set(self, i, obj): TypeError: 'int' object is not iterable
Thanks in advance for any assistance/direction you can provide.