Pyspark TypeError when using reduceByKey function to sum text length

Question

I would like to know why I am getting a type error when trying to calculate the total length of all characters within each list per given name (key), in the data below using the reduceByKey function.

data = [("Cassavetes, Frank", 'Orange'),
("Cassavetes, Frank", 'Pineapple'),
("Knight, Shirley (I)", 'Apple'),
("Knight, Shirley (I)", 'Blueberries'),
("Knight, Shirley (I)", 'Orange'),
("Yip, Françoise", 'Grapes'),
("Yip, Françoise", 'Apple'),
("Yip, Françoise", 'Strawberries'),
("Danner, Blythe", 'Pear'),
("Buck (X)", 'Kiwi')]

In an attempt to do this I tried to execute the code below;

rdd = spark.sparkContext.parallelize(data)
reducedRdd = rdd.reduceByKey( lambda a,b: len(a) + len(b) )
reducedRdd.collect()

The code above produces gives me the following error:

TypeError: object of type 'int' has no len()

The output I expected is as follows;

[('Yip, Françoise', 14), ('Cassavetes, Frank', 15), ('Knight, Shirley (I)', 8), ('Danner, Blythe', 'Pear'), ('Buck (X)', 'Kiwi')]

I have noticed the code below produces the desired results;

reducedRdd = rdd.reduceByKey( lambda a,b: len(str(a)) + len(str(b)) )

Though i am not sure why i would need to convert the variables a and b into strings if they are originally strings to begin with for example i am not sure how the 'Orange' in ("Cassavetes, Frank", 'Orange') can be considered an int.

PS i know i can use a number of other functions to achieve the desired results, but i specifically want to know why i am having issues trying to do this using the reduceByKey function.

blackbishop · Accepted Answer · 2021-03-17 21:19:24Z

1

The problem in your code is that the reduce function you pass to reduceByKey doesn't produce the same data type as the RDD values. The lambda function returns an int while your values are of type string.

To understand this simply consider how the reduce works. The function is applied to the first 2 values, then the result of the function is added to the third value, and so on...

Note that even the one that worked for you isn't actually correct. For example, it returns ('Danner, Blythe', 'Pear') instead of ('Danner, Blythe', 4).

You should first transform the values into their corresponding length then reduce by key :

reducedRdd = rdd.mapValues(lambda x: len(x)).reduceByKey(lambda a, b: a + b)
print(reducedRdd.collect())
# [('Cassavetes, Frank', 15), ('Danner, Blythe', 4), ('Buck (X)', 4), ('Knight, Shirley (I)', 22), ('Yip, Françoise', 23)]

answered Mar 17, 2021 at 21:19

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

KvothesLute Over a year ago

If the lambda function is applied to the first 2 values, then why is it when i change the lambda function to lambda a,b: a.upper() + b.upper() all 3 values attributed to "Knight, Shirley (I)" have the upper function applied, not just the first two ? (if this can work then why isn't my original lambda with the len function, applied to the third value to as it is in this example)

blackbishop Over a year ago

@KvothesLute the function lambda a,b: a.upper() + b.upper() returns a string in upper case and you're concatenating strings so no problem. In your first function len(x) returns an int not string.

Collectives™ on Stack Overflow

Pyspark TypeError when using reduceByKey function to sum text length

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related