1

I would like to know why I am getting a type error when trying to calculate the total length of all characters within each list per given name (key), in the data below using the reduceByKey function.

data = [("Cassavetes, Frank", 'Orange'),
("Cassavetes, Frank", 'Pineapple'),
("Knight, Shirley (I)", 'Apple'),
("Knight, Shirley (I)", 'Blueberries'),
("Knight, Shirley (I)", 'Orange'),
("Yip, Françoise", 'Grapes'),
("Yip, Françoise", 'Apple'),
("Yip, Françoise", 'Strawberries'),
("Danner, Blythe", 'Pear'),
("Buck (X)", 'Kiwi')]

In an attempt to do this I tried to execute the code below;

rdd = spark.sparkContext.parallelize(data)
reducedRdd = rdd.reduceByKey( lambda a,b: len(a) + len(b) )
reducedRdd.collect()

The code above produces gives me the following error:

TypeError: object of type 'int' has no len()

The output I expected is as follows;

[('Yip, Françoise', 14), ('Cassavetes, Frank', 15), ('Knight, Shirley (I)', 8), ('Danner, Blythe', 'Pear'), ('Buck (X)', 'Kiwi')]

I have noticed the code below produces the desired results;

reducedRdd = rdd.reduceByKey( lambda a,b: len(str(a)) + len(str(b)) )

Though i am not sure why i would need to convert the variables a and b into strings if they are originally strings to begin with for example i am not sure how the 'Orange' in ("Cassavetes, Frank", 'Orange') can be considered an int.

PS i know i can use a number of other functions to achieve the desired results, but i specifically want to know why i am having issues trying to do this using the reduceByKey function.

1 Answer 1

1

The problem in your code is that the reduce function you pass to reduceByKey doesn't produce the same data type as the RDD values. The lambda function returns an int while your values are of type string.

To understand this simply consider how the reduce works. The function is applied to the first 2 values, then the result of the function is added to the third value, and so on...

Note that even the one that worked for you isn't actually correct. For example, it returns ('Danner, Blythe', 'Pear') instead of ('Danner, Blythe', 4).

You should first transform the values into their corresponding length then reduce by key :

reducedRdd = rdd.mapValues(lambda x: len(x)).reduceByKey(lambda a, b: a + b)
print(reducedRdd.collect())
# [('Cassavetes, Frank', 15), ('Danner, Blythe', 4), ('Buck (X)', 4), ('Knight, Shirley (I)', 22), ('Yip, Françoise', 23)] 
Sign up to request clarification or add additional context in comments.

2 Comments

If the lambda function is applied to the first 2 values, then why is it when i change the lambda function to lambda a,b: a.upper() + b.upper() all 3 values attributed to "Knight, Shirley (I)" have the upper function applied, not just the first two ? (if this can work then why isn't my original lambda with the len function, applied to the third value to as it is in this example)
@KvothesLute the function lambda a,b: a.upper() + b.upper() returns a string in upper case and you're concatenating strings so no problem. In your first function len(x) returns an int not string.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.