0

Suppose I have a dataframe like below:

user       email                  day_diff  

tom        [email protected]         -10
tom        [email protected]        -2
tom        [email protected]            3
bob        [email protected]        -11
bob        [email protected]         1
bob        [email protected]          2
alice      [email protected]          4
Mary       [email protected]          -5

What I am looking to do is for each user take every email where day_diff is positive and the first record where day_diff is negative but closest to 0. Then compare those values and if any of them are different, in a new column the value would 'yes' and if they are all the same the value would be 'no'

So for tom I would take the email where day_diff is 3, [email protected], since it's the only positive day_diff and compare it to [email protected]. Since it is different the new column for every row for tom would be 'yes'

For bob I would take the emails where day_diff is 1 and 2 and compare it to -11. Since the email at 2 and -11 are different, the new column value would be 'yes'.

If a user only has one row and the day_diff is positive, the new column value is 'yes' If the user only has emails where day_diff is negative, the new column value is 'no'

Any help would be appreciated. I've been spinning in circles trying to figure this out.

The output would look like

user       email                  day_diff    email_change

tom        [email protected]         -10        yes
tom        [email protected]        -2         yes
tom        [email protected]            3         yes
bob        [email protected]        -11        yes
bob        [email protected]         1         yes
bob        [email protected]          2         yes
alice      [email protected]          4         yes
Mary       [email protected]          -5         no
1
  • You should show the desired output and all code. Commented Apr 17, 2020 at 20:19

1 Answer 1

1

Here is what I suggest :

import pandas as pd
import numpy as np

df = pd.DataFrame({"user": ["tom", "tom", "tom", "bob", "bob", "bob", "alice", "mary"],
                   "email": ["[email protected]", "[email protected]", "[email protected]", "[email protected]",
                             "[email protected]", " [email protected]", "[email protected]", "[email protected]"],
                   "day_dif": [-10, -2, 3, -11, 1, 2, 4, -5]})

# Treat case where no duplicates
df["dup"] = df["user"].duplicated(keep=False)
df["output"] = np.select([(df["dup"] == False) & (df["day_dif"] > 0), 
                          (df["dup"] == False) & (df["day_dif"] < 0)],
    ["yes", "no"], default=np.NaN)

# Treat duplicates
temp = df.loc[df["dup"], :]
temp = temp.copy()
temp["neg"] = np.where(temp["day_dif"] < 0, temp["day_dif"], np.NaN)
idx = temp.groupby("user")["neg"].nlargest(1).reset_index().level_1
# Create grouping variable that will help us make comparison
temp["pos"] = np.where(temp.index.isin(idx), 1,(temp["day_dif"] > 0) * 1)

groups = (temp.groupby(['user', "pos"])["email"].apply(list).reset_index()
              .sort_values(["user", "pos"]))
# compare all email in list by user and group pos
groups["output"] = groups["email"].apply(lambda x: all(w == x[0] for w in x))
# put on same line value for pos = 0 and pos = 1 for each user
groups["temp"] = groups["output"].shift(periods=-1)

# Apply your rules
groups["output"] = np.select([(groups.pos == 1) & (groups["output"] == False),
                              (groups.pos == 0) & (groups["temp"] == False)],
    ["yes", "yes"], default="no")
# reunite duplicates and non duplicates in one dataframe
new_df = pd.merge(df.loc[:, ["user", "email", "day_dif", "output"]],
                  groups[["user", "email", "output"]].explode(column="email"), 
                  on=["user", "email"], how="outer")
new_df["output"] = np.where(new_df["output_y"].isnull(), 
                            new_df["output_x"], new_df["output_y"])
new_df = new_df.drop(columns=["output_x", "output_y"]).drop_duplicates()

And the output is:

   user             email  day_dif output
0    tom   [email protected]      -10    yes
1    tom  [email protected]       -2    yes
2    tom     [email protected]        3    yes
3    bob  [email protected]      -11    yes
5    bob  [email protected]        1    yes
7    bob   [email protected]        2    yes
8  alice   [email protected]        4    yes
9   mary    [email protected]       -5     no
Sign up to request clarification or add additional context in comments.

2 Comments

I'm good until I get to the merge statement with explode. When I run it on my dataset I get the following error AttributeError: 'DataFrame' object has no attribute 'explode' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-3326578153059991681.py", line 333, in <module> raise Exception(traceback.format_exc()) Exception: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-3326578153059991681.py", line 331, in <module> exec(code)
Perhaps I missed something from your actual data. Your groups DataFrame should have a column email where in each line you have list of emails and this is what you explode. User and output should be simple columns of dtype str. Is that what you have ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.