1

Suppose I have a dataframe with multiple columns, I want to iterate each column, do some calculation and update that column. Is there any good way to do that?

2 Answers 2

4

@rogue-one has already answered your queries, you just need to modify the answer to meet your requirements.

Following is the solution by not using Window function.

val df = List(
  (2, 28),
  (1, 21),
  (7, 42)
).toDF("col1", "col2")

Your input dataframe should look like

+----+----+
|col1|col2|
+----+----+
|2   |28  |
|1   |21  |
|7   |42  |
+----+----+

Now to apply columnValue/sumOfColumnValues do as

val columnsModify = df.columns.map(col).map(colName => {
  val total = df.select(sum(colName)).first().get(0)
  colName/total as(s"${colName}")
})

df.select(columnsModify: _*).show(false)

You should get ouput as

+----+-------------------+
|col1|col2               |
+----+-------------------+
|0.2 |0.3076923076923077 |
|0.1 |0.23076923076923078|
|0.7 |0.46153846153846156|
+----+-------------------+
Sign up to request clarification or add additional context in comments.

2 Comments

Hi Ramesh, could you explain a little bit what is a .map(col) here? Thanks!
.map(col) is creating column objects. Without defining column objects, function calling won't work.
3

Update In below example I have a dataframe with two integer columns c1 and c2. each column's value is divided with the sum of its columns.

import org.apache.spark.sql.expressions.Window
val df = Seq((1,15), (2,20), (3,30)).toDF("c1","c2")
val result = df.columns.foldLeft(df)((acc, colname) => acc.withColumn(colname, sum(acc(colname)).over(Window.orderBy(lit(1)))/acc(colname)))

Output:

scala> result.show()
+---+------------------+
| c1|                c2|
+---+------------------+
|6.0| 4.333333333333333|
|3.0|              3.25|
|2.0|2.1666666666666665|
+---+------------------+

3 Comments

If I have a df with 1000 columns I cannot hand-write all match functions.. Is there a better way for this situation? Thanks!
depends on what are you doing with each column. if you have to do the same operation on all columns then its simple. if you have to do something unique for each column then you will have to handle each column.
What I need to do is calculate is calculate the sum of each column, and replace each data point in the column with (original number/sum). Basically speaking each column is the same.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.