0

I have two dataframes as listed below. The expected output is also below. The difference in the dataframes are in the 'college' column and the second dataframe is shorter by one row. I want to replace the 'college' column from df2 with the 'college' column from df1 when student_ID and student_NAME are matching. Does anyone know how to get the expected output?

import pyspark
from pyspark.sql import SparkSession
  
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of students  data
data = [["1", "Amit", "DU"],
        ["2", "Mohit", "DU"],
        ["3", "rohith", "BHU"],
        ["4", "sridevi", "LPU"],
        ["1", "sravan", "KLMP"],
        ["5", "gnanesh", "IIT"]]
  
# specify column names
columns = ['student_ID', 'student_NAME', 'college']
  
# creating a dataframe from the lists of data
df1 = spark.createDataFrame(data, columns)

data2 = [["1", "Amit", "jewf"],
         ["4", "sridevi", "wfv"],
        ["2", "Mohit", "efgew"],
        ["3", "rohith", "vwefv"],
         ["1", "sravan", "KLMP"],
        ["5", "gnanesh", "wfvw"]]
  
# specify column names
columns2 = ['student_ID', 'student_NAME', 'college']
  
# creating a dataframe from the lists of data
df2 = spark.createDataFrame(data2, columns2)

# expected output:
#  [["1", "Amit", "DU"],
#  ["4", "sridevi", "LPU"],
#  ["2", "Mohit", "DU"],
#  ["3", "rohith", "BHU"],
#  ["5", "sravan", "IIT"]]
2
  • have you tried join() Commented Oct 4, 2022 at 8:57
  • I think this post might help with your problem? Commented Oct 4, 2022 at 9:02

2 Answers 2

0

This can be achieved through a workaround- using .withCoulumnRenamed() rename df1.college to any other column name say school_1. Join both df1 and df2 and store in df3 on student_id condition. Select required columns from df3.

Sign up to request clarification or add additional context in comments.

Comments

0

You should change the column_name first. Both the dataframes shouldn't contain same columns. If both dataframes has the same column_name then if you call a column then it will show the error.

df1=df1.withColumnRenamed("student_ID","df1_ID")\
    .withColumnRenamed("student_NAME","df1_NAME")\
    .withColumnRenamed("college","df1_college")
df1.show()

output:

+------+--------+-----------+
|df1_ID|df1_NAME|df1_college|
+------+--------+-----------+
|     1|    Amit|         DU|
|     2|   Mohit|         DU|
|     3|  rohith|        BHU|
|     4| sridevi|        LPU|
|     1|  sravan|       KLMP|
|     5| gnanesh|        IIT|
+------+--------+-----------+

then we should use left or right join. I use left join here.

from pyspark.sql.functions import col
final_df = df2.join(df1, (df1["df1_ID"]==df2["student_ID"])  & (df1["df1_NAME"]==df2["student_NAME"]) ,"left")
final_df = final_df.select("student_ID","student_NAME",col("df1_college").alias("college"))
final_df.show()

output:

+----------+------------+-------+
|student_ID|student_NAME|college|
+----------+------------+-------+
|         1|        Amit|     DU|
|         4|     sridevi|    LPU|
|         2|       Mohit|     DU|
|         3|      rohith|    BHU|
|         1|      sravan|   KLMP|
|         5|     gnanesh|    IIT|
+----------+------------+-------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.