2

I need help because I wanted to filter some data from a dataframe as a criterion for another dataframe but I didn't want to use SQL commands.

df1

id ; create ; change ; name
1  ;2020-12-01;;Paul
2  ;2020-12-02;;Mary
3  ;2020-12-03;;David
4  ;2020-12-04;;Marley


df2
id ; create ; change ; name
1  ;2020-12-01;2020-12-30;Paul
2  ;2020-12-02;;Mary
3  ;2020-12-03;;David
4  ;2020-12-04;2020-12-30;Marley
5  ;2020-12-30;;Ted

df3

I wanted to create the df3 dataframe with the following rule where the id (df2) containing change pre-filled with the date 2020-12-30 and exists in df1 not to be inserted in df3

id ; create ; change ; name
2  ;2020-12-02;;Mary
3  ;2020-12-03;;David

1 Answer 1

2

You can first do a semi-join of df2 with df1, and then filter the change column.

df3 = df2.join(df1, ['id', 'create', 'name'], 'semi') \
         .filter("change is null or change != '2020-12-30'") \
         .select('id', 'create', 'change', 'name')

df3.show()
+---+----------+------+-----+
| id|    create|change| name|
+---+----------+------+-----+
|  2|2020-12-02|  null| Mary|
|  3|2020-12-03|  null|David|
+---+----------+------+-----+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.