3

I need to merge all the values of the dataframe's columns into a single value for each column. So the columns stay intact but I am just summing all the respective values. For this purpose I intend to utilize this function:

def sum_col(data, col):
    return data.select(f.sum(col)).collect()[0][0]

I was now thinking to do sth like this:

data = data.map(lambda current_col: sum_col(data, current_col))

Is this doable, or I need another way to merge all the values of the columns?

1
  • 1
    You can do it maybe using an udf. Custome defines a function that can generate another column by the result of applying one function to the df. Commented Jul 1, 2020 at 5:54

2 Answers 2

2

You can achieve this by sum function

import pyspark.sql.functions as f
df.select(*[f.sum(cols).alias(cols) for cols in df.columns]).show()

+----+---+---+
|val1|  x|  y|
+----+---+---+
|  36| 29|159|
+----+---+---+
Sign up to request clarification or add additional context in comments.

Comments

1

To sum all your columns to a new column you can use list comprehension with the sum function of python

import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
tst= sqlContext.createDataFrame([(10,7,14),(5,1,4),(9,8,10),(2,6,90),(7,2,30),(3,5,11)],schema=['val1','x','y'])
tst_sum= tst.withColumn("sum_col",sum([tst[coln] for coln in tst.columns]))

results:

tst_sum.show()
+----+---+---+-------+
|val1|  x|  y|sum_col|
+----+---+---+-------+
|  10|  7| 14|     31|
|   5|  1|  4|     10|
|   9|  8| 10|     27|
|   2|  6| 90|     98|
|   7|  2| 30|     39|
|   3|  5| 11|     19|
+----+---+---+-------+

Note : If you had imported sum function from pyspark function as from import pyspark.sql.functions import sum then you have to change the name to some thing else , like from import pyspark.sql.functions import sum_pyspark

1 Comment

Thanks for the answer. However, I need to get a single value for each column. In this case, val1 should have 1 value, namely 36 (the sum of all the values of val1 is 36). So, I need to replace the values of val1 with a single value which is the sum of all the original values.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.