Spark with own map and reduce functions python

Question

I'm trying to do a mapreduce like operation using python spark. Here is what i have and my problem.

object_list = list(objects) #this is precomputed earlier in my script
def my_map(obj):
    return [f(obj)]
def my_reduce(obj_list1, obj_list2):
    return obj_list1 + obj_list2

What I am trying to do in is something like the following:

myrdd = rdd(object_list) #objects are now spread out
myrdd.map(my_map)
myrdd.reduce(my_reduce)
my_result = myrdd.result()

where my_result should now just be = [f(obj1), f(obj2), ..., f(objn)]. I want to use spark purely for the speed, my script has been taking to long when doing this in a forloop. Does anyone know how to do the above in spark?

Paul · Accepted Answer · 2015-08-05 00:56:15Z

2

It would usually look like this:

myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).reduce(lambda a,b:a+b)

There is a sum function for RDDs, so this could also be:

myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).sum()

However, this will give you a single number. f(obj1)+f(obj2)+...

If you want an array of all the responses [f(obj1),f(obj2), ...], you would not use .reduce() or .sum() but instead use .collect():

myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).collect()

edited Aug 5, 2015 at 0:56

answered Aug 5, 2015 at 0:49

Paul

27.8k13 gold badges90 silver badges127 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Spark with own map and reduce functions python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related