3

I often hit problems where I wanna do a simple stuff over a set of many, many objects quickly. My natural choice is to use IPython Parallel for its simplicity, but often I have to deal with unpickable objects. After trying for a few hours I usually resign myself to running my taks overnight on a single computer, or do a stupid thing like dividing things semi-manually in to run in multiple python scripts.

To give a concrete example, suppose I want to delete all keys in a give S3 bucket.

What I'd normally do without thinking is:

import boto
from IPython.parallel import Client

connection = boto.connect_s3(awskey, awssec)
bucket = connection.get_bucket('mybucket')

client = Client()
loadbalancer = c.load_balanced_view()

keyList = list(bucket.list())
loadbalancer.map(lambda key: key.delete(), keyList)

The problem is that the Key object in boto is unpickable (*). This occurs very often in different contexts for me. It's a problem also with multiprocessing, execnet, and all other frameworks and libs I tried (for obvious reasons: they all use the same pickler to serialize the objects).

Do you guys also have those problems? Is there a way I can serialize these more complex objects? Do I have to write my own pickler for this particular objects? If I do, how do I tell IPython Parallel to use it? How do I write a pickler?

Thanks!


(*) I'm aware that I can simply make a list of the keys names and do something like this:

loadbalancer.map(lambda keyname: getKey(keyname).delete())

and define the getKey function in each engine of the IPython cluster. This is just a particular instance of a more general problem that I find often. Maybe it's a bad example, since it can be easily solved in another way.

1
  • 2
    I'm sure the tasks are embarrassed enough without you making fun of them on SO! Commented Mar 5, 2013 at 15:23

2 Answers 2

2

IPython has a use_dill option, where if you have the dill serializer installed, you can serialize most "unpicklable" objects.

How can I use dill instead of pickle with load_balanced_view

Sign up to request clarification or add additional context in comments.

Comments

0

That IPython sure brings people together ;). So from what I've been able to gather, the problem with pickling objects are their methods. So maybe instead of using the method of key to delete it you could write a function that takes it and deletes it. Maybe first get a list of dict's with the relevant information on each key and then afterwards call a function delete_key( dict ) which I leave up to you to write because I've no idea how to handle s3 keys.

Would that work?


Alternatively, it could be that this works: simply instead of calling the method of the instance, call the method of the class with the instance as an argument. So instead of lambda key : key.delete() you would do lambda key : Key.delete(key). Of course you have to push the class to the nodes then, but that shouldn't be a problem. A minimal example:

 class stuff(object):
       def __init__(self,a=1):
            self.list = []
       def append(self, a):
            self.list.append(a)

  import IPython.parallel as p
  c = p.Client()
  dview = c[:]

  li = map( stuff, [[]]*10 ) # creates 10 stuff instances

  dview.map( lambda x : x.append(1), li ) # should append 1 to all lists, but fails

  dview.push({'stuff':stuff}) # push the class to the engines
  dview.map( lambda x : stuff.append(x,1), li ) # this works.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.