Thanks for your help.
I followed your directives but the outcome was not as expected:
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df1 = spark.createDataFrame(d1, ['NAME', 'ID', 'DOB' , 'Height' , 'ShoeSize'])
df_dedupe = df1.dropDuplicates(['NAME', 'ID', 'DOB'])
df_reverse = df1.sort((["NAME", "ID", "DOB"]), ascending= False)
df_dedupe.join(df_reverse,['NAME','ID','DOB'],'inner')
df_dedupe.show(100, False)
The outcome was:
+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Bob |10 |1542189668|0 |0 |
|Alice|10 |1425298030|154 |39 |
+-----+---+----------+------+--------+
Showing the "Bob" with corrupted data.
Finally, I changed my approach and converted the DF to Pandas and then back to spark:
p_schema = StructType([StructField('NAME',StringType(),True),StructField('ID',StringType(),True),StructField('DOB',StringType(),True),StructField('Height',StringType(),True),StructField('ShoeSize',StringType(),True)])
d1 = [('Bob', '10', '1542189668', '0', '0'), ('Alice', '10', '1425298030', '154', '39'), ('Bob', '10', '1542189668', '178', '42')]
df = spark.createDataFrame(d1, p_schema)
pdf = df.toPandas()
df_dedupe = pdf.drop_duplicates(subset=['NAME','ID','DOB'], keep='last', inplace=False)
df_spark = spark.createDataFrame(df_dedupe, p_schema)
df_spark.show(100, False)
This finally brought the correct "Bob":
+-----+---+----------+------+--------+
|NAME |ID |DOB |Height|ShoeSize|
+-----+---+----------+------+--------+
|Alice|10 |1425298030|154 |39 |
|Bob |10 |1542189668|178 |42 |
+-----+---+----------+------+--------+
Of course, I'd still like to have a purely Spark solution but the lack of indexing seems to be problematic with Spark.
Thanks!
dropDuplicatesdo not have parakeepinpyspark