0

I'm trying to do a mapreduce like operation using python spark. Here is what i have and my problem.

object_list = list(objects) #this is precomputed earlier in my script
def my_map(obj):
    return [f(obj)]
def my_reduce(obj_list1, obj_list2):
    return obj_list1 + obj_list2

What I am trying to do in is something like the following:

myrdd = rdd(object_list) #objects are now spread out
myrdd.map(my_map)
myrdd.reduce(my_reduce)
my_result = myrdd.result()

where my_result should now just be = [f(obj1), f(obj2), ..., f(objn)]. I want to use spark purely for the speed, my script has been taking to long when doing this in a forloop. Does anyone know how to do the above in spark?

1 Answer 1

2

It would usually look like this:

myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).reduce(lambda a,b:a+b)

There is a sum function for RDDs, so this could also be:

myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).sum()

However, this will give you a single number. f(obj1)+f(obj2)+...

If you want an array of all the responses [f(obj1),f(obj2), ...], you would not use .reduce() or .sum() but instead use .collect():

myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).collect()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.