1

I have a large list of http user agent strings (taken from a pandas dataframe) that I am trying to parse using the python implementation of ua-parser. I can parse the list fine when only using a single thread, but based on some preliminary speed testing, it'd take me well over 10 hours to run the whole dataset.

I am trying to use pool.map() to decrease processing time but can't quite seem to figure out how to get it to work. I've read about a dozen 'tutorials' that I found online and have searched SO (likely a duplicate of some sort, as there are a lot of similar questions), but none of the dozens of attempts have worked for one reason or another. I'm assuming/hoping it's an easy fix.

Here is what I have so far:

from ua_parser import user_agent_parser    

http_str = df['user_agents'].tolist()

def uaparse(http_str):
        for i, item in enumerate(http_str):
            return user_agent_parser.Parse(http_str[i])

pool = mp.Pool(processes=10)
parsed = pool.map(uaparse, range(0,len(http_str))

Right now I'm seeing the following error message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-701fbf58d263> in <module>()
      7 
      8 pool = mp.Pool(processes=10)
----> 9 results = pool.map(uaparse, range(0,len(http_str)))

/home/ubuntu/anaconda/lib/python2.7/multiprocessing/pool.pyc in map(self, func, iterable, chunksize)
    249         '''
    250         assert self._state == RUN
--> 251         return self.map_async(func, iterable, chunksize).get()
    252 
    253     def imap(self, func, iterable, chunksize=1):

/home/ubuntu/anaconda/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout)
    565             return self._value
    566         else:
--> 567             raise self._value
    568 
    569     def _set(self, i, obj):

TypeError: 'int' object is not iterable

Thanks in advance for any assistance/direction you can provide.

1 Answer 1

1

It seems like all you need is:

http_str = df['user_agents'].tolist()

pool = mp.Pool(processes=10)
parsed = pool.map(user_agent_parser.Parse, http_str)
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! Never imagined it would be that simple.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.