Add columns on a Pyspark Dataframe

Question

I have a Pyspark Dataframe with this structure:

+----+----+----+----+---+
|user| A/B|   C| A/B| C | 
+----+----+-------------+
|  1 |   0|   1|   1|  2| 
|  2 |   0|   2|   4|  0| 
+----+----+----+----+---+

I had originally two dataframes, but I outer joined them using user as key, so there could be also null values. I can't find the way to sum the columns with equal name in order to get a dataframe like this:

+----+----+----+
|user| A/B|   C| 
+----+----+----+
|  1 |   1|   3| 
|  2 |   4|   2| 
+----+----+----+

Also note that there could be many equal columns, so selecting literally each column is not an option. In pandas this was possible using "user" as Index and then adding both dataframes. How can I do this on Spark?

Shivansh · Accepted Answer · 2016-10-20 20:25:55Z

1

I have a work around for this

val dataFrameOneColumns=df1.columns.map(a=>if(a.equals("user")) a else a+"_1")
val updatedDF=df1.toDF(dataFrameOneColumns:_*)

Now make the Join then the out will contain the Values with different names

Then make the tuple of the list to be combined

val newlist=df1.columns.filter(_.equals("user").zip(dataFrameOneColumns.filter(_.equals("user"))

And them Combine the value of the Columns within each tuple and get the desired output !

PS: i am guessing you can write the logic for combining ! So i am not spoon feeding !

answered Oct 20, 2016 at 20:25

Shivansh

3,55426 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

marlanbar Over a year ago

I got the first step of naming the first dataframe with the columns with _1 as suffix but you got me a little lost in the second step. Could you please rewrite it in Python (Pyspark)? You wrote it in Scala.

Shivansh Over a year ago

Sorry I am not that much familiar with python but i can tell you the concept , I am just making the tuple of the column names that sound familiar and then apply sum function on the two values of the tuple to get the output.

Collectives™ on Stack Overflow

Add columns on a Pyspark Dataframe

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related