I have two dataframes as listed below. The expected output is also below. The difference in the dataframes are in the 'college' column and the second dataframe is shorter by one row. I want to replace the 'college' column from df2 with the 'college' column from df1 when student_ID and student_NAME are matching. Does anyone know how to get the expected output?
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of students data
data = [["1", "Amit", "DU"],
["2", "Mohit", "DU"],
["3", "rohith", "BHU"],
["4", "sridevi", "LPU"],
["1", "sravan", "KLMP"],
["5", "gnanesh", "IIT"]]
# specify column names
columns = ['student_ID', 'student_NAME', 'college']
# creating a dataframe from the lists of data
df1 = spark.createDataFrame(data, columns)
data2 = [["1", "Amit", "jewf"],
["4", "sridevi", "wfv"],
["2", "Mohit", "efgew"],
["3", "rohith", "vwefv"],
["1", "sravan", "KLMP"],
["5", "gnanesh", "wfvw"]]
# specify column names
columns2 = ['student_ID', 'student_NAME', 'college']
# creating a dataframe from the lists of data
df2 = spark.createDataFrame(data2, columns2)
# expected output:
# [["1", "Amit", "DU"],
# ["4", "sridevi", "LPU"],
# ["2", "Mohit", "DU"],
# ["3", "rohith", "BHU"],
# ["5", "sravan", "IIT"]]
join()