1

So I have a dataset and what I am doing is taking a column out of the dataset, than mapping it to a key value pair. The problem is I can't sum my value:

position = 1
myData = dataSplit.map(lambda arr: (arr[position]))
print myData.take(10)
myData2 = myData.map(lambda line: line.split(',')).map(lambda fields: (“Column", fields[0])).groupByKey().map(lambda (Column, values): (Column, sum(float(values))))
print myData2.take(10)

This prints out the following:

[u'18964', u'18951', u'18950', u'18949', u'18960', u'18958', u'18956', u'19056', u'18948', u'18969’]
TypeError: float() argument must be a string or a number

So When I changed it to:

myData2 = myData.map(lambda line: line.split(',')).map(lambda fields: (“Column", fields[0])).groupByKey().map(lambda (Column, values): (values))

I see the following:

[<pyspark.resultiterable.ResultIterable object at 0x7fab6c43f1d0>]

If I do just:

myData2 = myData.map(lambda line: line.split(',')).map(lambda fields: (“Column", fields[0]))

I get this:

[('Column', u'18964'), ('Column', u'18951'), ('Column', u'18950'), ('Column', u'18949'), ('Column', u'18960'), ('Column', u'18958'), ('Column', u'18956'), ('Column', u'19056'), ('Column', u'18948'), ('Column', u'18969’)]

Any Suggestions?

1 Answer 1

2

Solved:

myData2 = myData.map(lambda line: line.split(',')).map(lambda fields: ("Column", float(fields[0]))).groupByKey().map(lambda (Column, values): (Column, sum(values)))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.