I would like to know why I am getting a type error when trying to calculate the total length of all characters within each list per given name (key), in the data below using the reduceByKey function.
data = [("Cassavetes, Frank", 'Orange'),
("Cassavetes, Frank", 'Pineapple'),
("Knight, Shirley (I)", 'Apple'),
("Knight, Shirley (I)", 'Blueberries'),
("Knight, Shirley (I)", 'Orange'),
("Yip, Françoise", 'Grapes'),
("Yip, Françoise", 'Apple'),
("Yip, Françoise", 'Strawberries'),
("Danner, Blythe", 'Pear'),
("Buck (X)", 'Kiwi')]
In an attempt to do this I tried to execute the code below;
rdd = spark.sparkContext.parallelize(data)
reducedRdd = rdd.reduceByKey( lambda a,b: len(a) + len(b) )
reducedRdd.collect()
The code above produces gives me the following error:
TypeError: object of type 'int' has no len()
The output I expected is as follows;
[('Yip, Françoise', 14), ('Cassavetes, Frank', 15), ('Knight, Shirley (I)', 8), ('Danner, Blythe', 'Pear'), ('Buck (X)', 'Kiwi')]
I have noticed the code below produces the desired results;
reducedRdd = rdd.reduceByKey( lambda a,b: len(str(a)) + len(str(b)) )
Though i am not sure why i would need to convert the variables a and b into strings if they are originally strings to begin with for example i am not sure how the 'Orange' in ("Cassavetes, Frank", 'Orange') can be considered an int.
PS i know i can use a number of other functions to achieve the desired results, but i specifically want to know why i am having issues trying to do this using the reduceByKey function.