How to create columns in a dataframe out of columns of another dataframe in PySpark

Question

Assuming that I have the following Spark DataFrame df:

+-----+-------+-------+-------+
| id  | col1  |  col2 |  col3 |
+-----+-------+-------+-------+
| "a" |   10  |  5    |   75  |
| "b" |   20  |  3    |   3   | 
| "c" |   30  |  2    |   65  |
+-----+-------+-------+-------+

I want to create a new dataframe new_df that contains:

1) the id of each row

2) the value of the division between col1 / col2 and

3) the value of the division between col3 / col1

The desired output for new_df is:

+-----+-------+-------+
| id  | col1_2| col3_1|
+-----+-------+-------+
| "a" |  2    |  7.5  |
| "b" |  6.67 |  0.15 | 
| "c" |   15  |  2.17 |
+-----+-------+-------+

I have already tried

new_df = df.select("id").withColumn("col1_2", df["col1"] / df["col2"))

without any luck

The trouble is that the first select you are doing returns a dataframe with only that column, so the subsequent withColumn operation will fail because col1 and col2 are not available. You could switch the order: df.withColumn("col1_2", df["col1"] / df["col2"]).select("id", "col1_2") — pault
– pault, Commented Aug 20, 2019 at 16:08

akuiper · Accepted Answer · 2019-08-20 16:03:15Z

2

Either use select:

df.select('id', 
  (df.col1 / df.col2).alias('col1_2'), 
  (df.col3 / df.col1).alias('col3_1')
).show()
+---+-----------------+------------------+
| id|           col1_2|            col3_1|
+---+-----------------+------------------+
|  a|              2.0|               7.5|
|  b|6.666666666666667|              0.15|
|  c|             15.0|2.1666666666666665|
+---+-----------------+------------------+

Or selectExpr:

df.selectExpr('id', 'col1 / col2 as col1_2', 'col3 / col1 as col3_1').show()
+---+-----------------+------------------+
| id|           col1_2|            col3_1|
+---+-----------------+------------------+
|  a|              2.0|               7.5|
|  b|6.666666666666667|              0.15|
|  c|             15.0|2.1666666666666665|
+---+-----------------+------------------+

answered Aug 20, 2019 at 16:03

akuiper

216k33 gold badges362 silver badges379 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to create columns in a dataframe out of columns of another dataframe in PySpark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related