Comparing multiple rows in dataframe to single row by column

Question

Suppose I have a dataframe like below:

user       email                  day_diff  

tom        [email protected]         -10
tom        [email protected]        -2
tom        [email protected]            3
bob        [email protected]        -11
bob        [email protected]         1
bob        [email protected]          2
alice      [email protected]          4
Mary       [email protected]          -5

What I am looking to do is for each user take every email where day_diff is positive and the first record where day_diff is negative but closest to 0. Then compare those values and if any of them are different, in a new column the value would 'yes' and if they are all the same the value would be 'no'

So for tom I would take the email where day_diff is 3, [email protected], since it's the only positive day_diff and compare it to [email protected]. Since it is different the new column for every row for tom would be 'yes'

For bob I would take the emails where day_diff is 1 and 2 and compare it to -11. Since the email at 2 and -11 are different, the new column value would be 'yes'.

If a user only has one row and the day_diff is positive, the new column value is 'yes' If the user only has emails where day_diff is negative, the new column value is 'no'

Any help would be appreciated. I've been spinning in circles trying to figure this out.

The output would look like

user       email                  day_diff    email_change

tom        [email protected]         -10        yes
tom        [email protected]        -2         yes
tom        [email protected]            3         yes
bob        [email protected]        -11        yes
bob        [email protected]         1         yes
bob        [email protected]          2         yes
alice      [email protected]          4         yes
Mary       [email protected]          -5         no

You should show the desired output and all code.

David Smolinski
– David Smolinski

2020-04-17 20:19:48 +00:00
Commented Apr 17, 2020 at 20:19 — David Smolinski
– David Smolinski, Commented Apr 17, 2020 at 20:19

Raphaele Adjerad · Accepted Answer · 2020-04-18 08:42:45Z

1

Here is what I suggest :

import pandas as pd
import numpy as np

df = pd.DataFrame({"user": ["tom", "tom", "tom", "bob", "bob", "bob", "alice", "mary"],
                   "email": ["[email protected]", "[email protected]", "[email protected]", "[email protected]",
                             "[email protected]", " [email protected]", "[email protected]", "[email protected]"],
                   "day_dif": [-10, -2, 3, -11, 1, 2, 4, -5]})

# Treat case where no duplicates
df["dup"] = df["user"].duplicated(keep=False)
df["output"] = np.select([(df["dup"] == False) & (df["day_dif"] > 0), 
                          (df["dup"] == False) & (df["day_dif"] < 0)],
    ["yes", "no"], default=np.NaN)

# Treat duplicates
temp = df.loc[df["dup"], :]
temp = temp.copy()
temp["neg"] = np.where(temp["day_dif"] < 0, temp["day_dif"], np.NaN)
idx = temp.groupby("user")["neg"].nlargest(1).reset_index().level_1
# Create grouping variable that will help us make comparison
temp["pos"] = np.where(temp.index.isin(idx), 1,(temp["day_dif"] > 0) * 1)

groups = (temp.groupby(['user', "pos"])["email"].apply(list).reset_index()
              .sort_values(["user", "pos"]))
# compare all email in list by user and group pos
groups["output"] = groups["email"].apply(lambda x: all(w == x[0] for w in x))
# put on same line value for pos = 0 and pos = 1 for each user
groups["temp"] = groups["output"].shift(periods=-1)

# Apply your rules
groups["output"] = np.select([(groups.pos == 1) & (groups["output"] == False),
                              (groups.pos == 0) & (groups["temp"] == False)],
    ["yes", "yes"], default="no")
# reunite duplicates and non duplicates in one dataframe
new_df = pd.merge(df.loc[:, ["user", "email", "day_dif", "output"]],
                  groups[["user", "email", "output"]].explode(column="email"), 
                  on=["user", "email"], how="outer")
new_df["output"] = np.where(new_df["output_y"].isnull(), 
                            new_df["output_x"], new_df["output_y"])
new_df = new_df.drop(columns=["output_x", "output_y"]).drop_duplicates()

And the output is:

   user             email  day_dif output
0    tom   [email protected]      -10    yes
1    tom  [email protected]       -2    yes
2    tom     [email protected]        3    yes
3    bob  [email protected]      -11    yes
5    bob  [email protected]        1    yes
7    bob   [email protected]        2    yes
8  alice   [email protected]        4    yes
9   mary    [email protected]       -5     no

edited Apr 18, 2020 at 8:42

answered Apr 18, 2020 at 7:21

Raphaele Adjerad

1,1257 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

sjkluend Over a year ago

I'm good until I get to the merge statement with explode. When I run it on my dataset I get the following error AttributeError: 'DataFrame' object has no attribute 'explode' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-3326578153059991681.py", line 333, in <module> raise Exception(traceback.format_exc()) Exception: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-3326578153059991681.py", line 331, in <module> exec(code)

Raphaele Adjerad Over a year ago

Perhaps I missed something from your actual data. Your groups DataFrame should have a column email where in each line you have list of emails and this is what you explode. User and output should be simple columns of dtype str. Is that what you have ?

Collectives™ on Stack Overflow

Comparing multiple rows in dataframe to single row by column

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related