Row Sum of a each row in a Dataframe using Pyspark [duplicate]

Question

There is a scenario of finding the sum of rows in a DF as follows

ID DEPT [..] SUB1 SUB2 SUB3 SUB4  **SUM1**
1  PHY      50    20   30   30   130
2  COY      52    62   63   34   211
3  DOY      53    52   53   84
4  ROY      56    52   53   74
5  SZY      57    62   73   54

Need to find row sum of SUB1 SUB2 SUB3 SUB4 for each rows and make as a new column SUM1. The ordinal position of the column SUB1 in the data frame is 16.

mck · Accepted Answer · 2021-02-20 15:04:55Z

7

You can use the Python sum to add up the columns:

import pyspark.sql.functions as F

col_list = ['SUB1', 'SUB2', 'SUB3', 'SUB4']
# or col_list = df.columns[16:20]

df2 = df.withColumn(
    'SUM1',
    sum([F.col(c) for c in col_list])
)

answered Feb 20, 2021 at 15:04

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user1254579 Over a year ago

Thank you . There are 106 columns to be sumed.It work well with fewer than 100 columns.But for more than 100 columns it shows the following error org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Max iterations (100) reached for batch Resolution, please set 'spark.sql.analyzer.maxIterations' to a larger value., tree:

user1254579 Over a year ago

try this spark.sql.optimizer.maxIterations 100 ?

mck Over a year ago

maybe set it to a larger value, e.g. 200/1000 . using spark.sql("set spark.sql.analyzer.maxIterations = 200").

Collectives™ on Stack Overflow

Row Sum of a each row in a Dataframe using Pyspark [duplicate]

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related