2

there are two pandas DataFrames:

df1 = pd.DataFrame({
    'name': ['ann', 'maxim', 'ann', 'maxim'],
    'surname': [ 'smith', 'shwarz','smith', 'shwarz'],
    'date': ['2020.01.01',  '2020.01.01', '2020.03.05','2020.03.05'],
    'mark_1': [None,'B', 'A', None],
    'mark_2': [None,'B', None,'A'],
    'mark_3': [None,None, 'A', 'C']
       })
name surname date mark_1 mark_2 mark_3
ann smith 2020.01.01 None None None
maxim shwarz 2020.01.01 B B None
ann smith 2020.03.05 A None A
maxim shwarz 2020.03.05 None A C
df2 = pd.DataFrame({
    'name': ['ann', 'maxim'],
    'surname': [ 'smith', 'shwarz'],
    'mark_1': ['Z','X'],
    'mark_2': ['H','F'],
    'mark_3': ['P','Y']
       })
name surname mark_1 mark_2 mark_3
ann smith Z H P
maxim shwarz X F Y

I need:

name surname date mark_1 mark_2 mark_3
ann smith 2020.01.01 Z H P
maxim shwarz 2020.01.01 B B Y
ann smith 2020.03.05 A H A
maxim shwarz 2020.03.05 X A C

But functiondf1.isnull(df2) replaces only first rows with the similar names and surnames:

name surname date mark_1 mark_2 mark_3
ann smith 2020.01.01 Z H P
maxim shwarz 2020.01.01 B B Y
ann smith 2020.03.05 A None A
maxim shwarz 2020.03.05 None A C

As I understand, it should be something like CASE statement from SQL, but I can't find the answer.

Special respect if you can explain the same function for two PySpark DataFrames!

2 Answers 2

2

Try with set_index + combine_first:

new_df = (
    df1.set_index(['name', 'surname'])
        .combine_first(df2.set_index(['name', 'surname']))
        .reset_index()
)

new_df:

    name surname        date mark_1 mark_2 mark_3
0    ann   smith  2020.01.01      Z      H      P
1    ann   smith  2020.03.05      A      H      A
2  maxim  shwarz  2020.01.01      B      B      Y
3  maxim  shwarz  2020.03.05      X      A      C

Optional sort_values:

new_df = (
    df1.set_index(['name', 'surname'])
        .combine_first(df2.set_index(['name', 'surname']))
        .reset_index()
        .sort_values('date')
)

new_df:

    name surname        date mark_1 mark_2 mark_3
0    ann   smith  2020.01.01      Z      H      P
2  maxim  shwarz  2020.01.01      B      B      Y
1    ann   smith  2020.03.05      A      H      A
3  maxim  shwarz  2020.03.05      X      A      C
Sign up to request clarification or add additional context in comments.

2 Comments

I tried combine_first without set_index, was it the problem? Is it the reason why python take uses something like non-greedy search?
pandas DataFrame alignment is based on the index without set_index combine_first would try to pair row 0 in df1 with row 0 in df2. By setting the index, now combine_first is aligned by name and surname and will combine where they match.
1

Using Spark you must join the dataframes and use coalesce function to replace null values:

import pandas as pd
import pyspark.sql.functions as f


df1 = pd.DataFrame({
    'name': ['ann', 'maxim', 'ann', 'maxim'],
    'surname': [ 'smith', 'shwarz','smith', 'shwarz'],
    'date': ['2020.01.01',  '2020.01.01', '2020.03.05','2020.03.05'],
    'mark_1': [None,'B', 'A', None],
    'mark_2': [None,'B', None,'A'],
    'mark_3': [None,None, 'A', 'C']
})
df1 = spark.createDataFrame(df1)

df2 = pd.DataFrame({
    'name': ['ann', 'maxim'],
    'surname': [ 'smith', 'shwarz'],
    'mark_1': ['Z','X'],
    'mark_2': ['H','F'],
    'mark_3': ['P','Y']
})
df2 = spark.createDataFrame(df2)

df3 = df1.alias('l').join(df2.alias('r'), on=['name', 'surname'], how='left')
df3 = (df3
       .select('name', 
               'surname', 
               'date', 
               f.coalesce('l.mark_1', 'r.mark_1').alias('mark_1'), 
               f.coalesce('l.mark_2', 'r.mark_2').alias('mark_2'), 
               f.coalesce('l.mark_3', 'r.mark_3').alias('mark_3')))

(df3
 .sort('date')
 .show(truncate=False))
# +-----+-------+----------+------+------+------+
# |name |surname|date      |mark_1|mark_2|mark_3|
# +-----+-------+----------+------+------+------+
# |ann  |smith  |2020.01.01|Z     |H     |P     |
# |maxim|shwarz |2020.01.01|B     |B     |Y     |
# |ann  |smith  |2020.03.05|A     |H     |A     |
# |maxim|shwarz |2020.03.05|X     |A     |C     |
# +-----+-------+----------+------+------+------+

1 Comment

Thank you so much!!! I thought about coalesce, but couldn't understand how use it...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.