Replace pyspark column with another dataframe's column / merge pyspark dataframes

Question

I have two dataframes as listed below. The expected output is also below. The difference in the dataframes are in the 'college' column and the second dataframe is shorter by one row. I want to replace the 'college' column from df2 with the 'college' column from df1 when student_ID and student_NAME are matching. Does anyone know how to get the expected output?

import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of students  data
data = [["1", "Amit", "DU"],
        ["2", "Mohit", "DU"],
        ["3", "rohith", "BHU"],
        ["4", "sridevi", "LPU"],
        ["1", "sravan", "KLMP"],
        ["5", "gnanesh", "IIT"]]
  
# specify column names
columns = ['student_ID', 'student_NAME', 'college']
  
# creating a dataframe from the lists of data
df1 = spark.createDataFrame(data, columns)

data2 = [["1", "Amit", "jewf"],
         ["4", "sridevi", "wfv"],
        ["2", "Mohit", "efgew"],
        ["3", "rohith", "vwefv"],
         ["1", "sravan", "KLMP"],
        ["5", "gnanesh", "wfvw"]]
  
# specify column names
columns2 = ['student_ID', 'student_NAME', 'college']
  
# creating a dataframe from the lists of data
df2 = spark.createDataFrame(data2, columns2)

# expected output:
#  [["1", "Amit", "DU"],
#  ["4", "sridevi", "LPU"],
#  ["2", "Mohit", "DU"],
#  ["3", "rohith", "BHU"],
#  ["5", "sravan", "IIT"]]

have you tried join()

samkart
– samkart

2022-10-04 08:57:47 +00:00
Commented Oct 4, 2022 at 8:57 — samkart
– samkart, Commented Oct 4, 2022 at 8:57
I think this post might help with your problem?

snowmoss10
– snowmoss10

2022-10-04 09:02:06 +00:00
Commented Oct 4, 2022 at 9:02 — snowmoss10
– snowmoss10, Commented Oct 4, 2022 at 9:02

Pnkzz · Accepted Answer · 2022-10-04 09:48:53Z

0

This can be achieved through a workaround- using .withCoulumnRenamed() rename df1.college to any other column name say school_1. Join both df1 and df2 and store in df3 on student_id condition. Select required columns from df3.

answered Oct 4, 2022 at 9:48

Pnkzz

671 silver badge7 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

geanakuch · Accepted Answer · 2023-06-14 19:42:35Z

You should change the column_name first. Both the dataframes shouldn't contain same columns. If both dataframes has the same column_name then if you call a column then it will show the error.

df1=df1.withColumnRenamed("student_ID","df1_ID")\
    .withColumnRenamed("student_NAME","df1_NAME")\
    .withColumnRenamed("college","df1_college")
df1.show()

output:

+------+--------+-----------+
|df1_ID|df1_NAME|df1_college|
+------+--------+-----------+
|     1|    Amit|         DU|
|     2|   Mohit|         DU|
|     3|  rohith|        BHU|
|     4| sridevi|        LPU|
|     1|  sravan|       KLMP|
|     5| gnanesh|        IIT|
+------+--------+-----------+

then we should use left or right join. I use left join here.

from pyspark.sql.functions import col
final_df = df2.join(df1, (df1["df1_ID"]==df2["student_ID"])  & (df1["df1_NAME"]==df2["student_NAME"]) ,"left")
final_df = final_df.select("student_ID","student_NAME",col("df1_college").alias("college"))
final_df.show()

output:

+----------+------------+-------+
|student_ID|student_NAME|college|
+----------+------------+-------+
|         1|        Amit|     DU|
|         4|     sridevi|    LPU|
|         2|       Mohit|     DU|
|         3|      rohith|    BHU|
|         1|      sravan|   KLMP|
|         5|     gnanesh|    IIT|
+----------+------------+-------+

Collectives™ on Stack Overflow

Replace pyspark column with another dataframe's column / merge pyspark dataframes

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related