2

There is a scenario of finding the sum of rows in a DF as follows

ID DEPT [..] SUB1 SUB2 SUB3 SUB4  **SUM1**
1  PHY      50    20   30   30   130
2  COY      52    62   63   34   211
3  DOY      53    52   53   84
4  ROY      56    52   53   74
5  SZY      57    62   73   54

Need to find row sum of SUB1 SUB2 SUB3 SUB4 for each rows and make as a new column SUM1. The ordinal position of the column SUB1 in the data frame is 16.

0

1 Answer 1

7

You can use the Python sum to add up the columns:

import pyspark.sql.functions as F

col_list = ['SUB1', 'SUB2', 'SUB3', 'SUB4']
# or col_list = df.columns[16:20]

df2 = df.withColumn(
    'SUM1',
    sum([F.col(c) for c in col_list])
)
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you . There are 106 columns to be sumed.It work well with fewer than 100 columns.But for more than 100 columns it shows the following error org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Max iterations (100) reached for batch Resolution, please set 'spark.sql.analyzer.maxIterations' to a larger value., tree:
try this spark.sql.optimizer.maxIterations 100 ?
maybe set it to a larger value, e.g. 200/1000 . using spark.sql("set spark.sql.analyzer.maxIterations = 200").

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.