Error using reducebykey: int object is unsubscriptable

Question

I'm getting an error "int object is unsubscriptable" while executing the following script :

element.reduceByKey( lambda x , y : x[1]+y[1])

with element is an key-value RDD and the value is a tuple. Example input:

(A, (toto , 10))
(A, (titi , 30))
(5, (tata, 10))
(A, (toto, 10))

I understand that the reduceByKey function takes (K,V) tuples and apply a function on all the values to get the final result of the reduce. Like the example given in ReduceByKey Apache.

Any help please?

What output do you want? The problem is that x[1]+y[1] is an int, not a tuple (which is what reduceByKey expect in the next iteration. — Shaido
– Shaido, Commented Jan 16, 2018 at 6:17
The output expected is (A , 50) (5, 10), but why reduceByKey should expect a tuple in the next iteration? should it keep the same type of the values reduced? — Eliane PDC
– Eliane PDC, Commented Jan 16, 2018 at 16:34

pault · Accepted Answer · 2018-01-17 18:13:11Z

Here is an example that will illustrate what's going on.

Let's consider what happens when you call reduce on a list with some function f:

reduce(f, [a,b,c]) = f(f(a,b),c)

If we take your example, f = lambda u, v: u[1] + v[1], then the above expression breaks down into:

reduce(f, [a,b,c]) = f(f(a,b),c) = f(a[1]+b[1],c)

But a[1] + b[1] is an integer so there is no __getitem__ method, hence your error.

In general, the better approach (as shown below) is to use map() to first extract the data in the format that you want, and then apply reduceByKey().

A MCVE with your data

element = sc.parallelize(
    [
        ('A', ('toto' , 10)),
        ('A', ('titi' , 30)),
        ('5', ('tata', 10)),
        ('A', ('toto', 10))
    ]
)

You can almost get your desired output with a more sophisticated reduce function:

def add_tuple_values(a, b):
    try:
        u = a[1]
    except:
        u = a
    try:
        v = b[1]
    except:
        v = b
    return u + v

print(element.reduceByKey(add_tuple_values).collect())

Except that this results in:

[('A', 50), ('5', ('tata', 10))]

Why? Because there's only one value for the key '5', so there is nothing to reduce.

For these reasons, it's best to first call map. To get your desired output, you could do:

>>> print(element.map(lambda x: (x[0], x[1][1])).reduceByKey(lambda u, v: u+v).collect())
[('A', 50), ('5', 10)]

Update 1

Here's one more approach:

You could create tuples in your reduce function, and then call map to extract the value you want. (Essentially reverse the order of map and reduce.)

print(
    element.reduceByKey(lambda u, v: (0,u[1]+v[1]))
        .map(lambda x: (x[0], x[1][1]))
        .collect()
)
[('A', 50), ('5', 10)]

Notes

Had there been at least 2 records for each key, using add_tuple_values() would have given you the correct output.

Bala · Accepted Answer · 2018-01-19 09:07:16Z

2

Another approach would be to use Dataframe

rdd = sc.parallelize([('A', ('toto', 10)),('A', ('titi', 30)),('5', ('tata', 10)),('A', ('toto', 10))])
rdd.map(lambda (a,(b,c)): (a,b,c)).toDF(['a','b','c']).groupBy('a').agg(sum("c")).rdd.map(lambda (a,c): (a,c)).collect()

>>>[(u'5', 10), (u'A', 50)]

answered Jan 19, 2018 at 9:07

Bala

11.3k19 gold badges75 silver badges134 bronze badges

Collectives™ on Stack Overflow

Error using reducebykey: int object is unsubscriptable

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related