How to use multiple conditions in different columns to update the new rows values in python?

Question

This is the current dataframe:

    id = ['793601486525702000','793601486525702000','793601710614802000','793601355214561000','793601355214561000','793601355214561000','793601355214561000','788130215436230000','788130215436230000','788130215436230000','788130215436230000','788130215436230000'] 
    time = ['11/1/2016 16:53','11/1/2016 16:53','11/1/2016 16:52','11/1/2016 16:55','11/1/2016 16:53','11/1/2016 16:53','11/1/2016 16:51','11/1/2016 3:09','11/1/2016 3:04','11/1/2016 2:36','11/1/2016 2:08','11/1/2016 0:28'] 
    rank = ['2','1','1','4','3','2','1','5','4','3','2','1'] 
    flag =['c_reply','c_start','u_start','u_reply','c_reply','c_reply','u_start','c_reply','c_reply','u_reply','u_reply','u_start']
    df = pd.DataFrame({"id": id, "time": time, "rank": rank, "flag": flag})

            id                time        rank     flag
                              .
                              .
    793601486525702000  11/1/2016 16:53    2      c_reply
    793601486525702000  11/1/2016 16:53    1      c_start
    793601710614802000  11/1/2016 16:52    1      u_start
    793601355214561000  11/1/2016 16:55    4      u_reply
    793601355214561000  11/1/2016 16:53    3      c_reply
    793601355214561000  11/1/2016 16:53    2      c_reply
    793601355214561000  11/1/2016 16:51    1      u_start
    788130215436230000  11/1/2016 3:09     5      c_reply
    788130215436230000  11/1/2016 3:04     4      c_reply
    788130215436230000  11/1/2016 2:36     3      u_reply
    788130215436230000  11/1/2016 2:08     2      u_reply
    788130215436230000  11/1/2016 0:28     1      u_start
                              .
                              .

My dataset has thousands of rows.
The column 'id': One id might have multiple rows/records. The rows have the same id means they are in the same group.
The column 'rank' is arranged by the chronological order of the same group of id.

I would like to use a loop or function to create two new columns 'reply' and 'reply_time' based on multiple columns: 'id', 'rank', 'time', and 'flag' in my dataframe.
Step 1: Select rows in the same id group (group by id column)
Step 2: Update 'reply' column value:The conditions I would like to set are as follows:

value '0' : rank = '1' and flag = 'u_start' and no 'c_reply' in flag column
value '1' : rank = '1' and flag = 'u_start' and has 'c_reply' in flag column
value '2' : the first/earliest c_reply in flag column. (if there's multiple c_reply, list the earliest c_reply (the smaller value in rank column))
value '3' : If the above conditions aren't met, the rows should be assigned to this category, including (1)rank = '1' and flag = 'c_start' OR (2)rank >= '2' and flag = 'u_reply' OR (3)rank >= '2' and flag = 'c_reply' and not the first c_reply in flag column OR (4) rank >= '2' and flag = 'c_reply' and no 'u_start' in flag column

Step 3: Update 'reply_time' column value:The conditions I would like to set are as follows:
value 'time': rank = '1' and flag = 'u_start' and has 'c_reply' in flag column, list the first/earliest 'c_reply' time.
value 'na': If the above conditions aren't met, the rows should be assigned to 'na'.

The target output would look something like this:

            id                 time       rank      flag   reply   reply_time
    793601486525702000  11/1/2016 16:53     2     c_reply    3      na
    793601486525702000  11/1/2016 16:53     1     c_start    3      na
    793601710614802000  11/1/2016 16:52     1     u_start    0      na
    793601355214561000  11/1/2016 16:55     4     u_reply    3      na
    793601355214561000  11/1/2016 16:53     3     c_reply    3      na
    793601355214561000  11/1/2016 16:53     2     c_reply    2      na
    793601355214561000  11/1/2016 16:51     1     u_start    1      11/1/2016 16:53
    788130215436230000  11/1/2016 3:09      5     c_reply    3      na
    788130215436230000  11/1/2016 3:04      4     c_reply    2      na
    788130215436230000  11/1/2016 2:36      3     u_reply    3      na
    788130215436230000  11/1/2016 2:08      2     u_reply    3      na
    788130215436230000  11/1/2016 0:28      1     u_start    1      11/1/2016 3:04

It seems like a simple question however I couldn't find it anywhere.
I used excel to do the manual coding now but I think there should be a faster way to solve this by using python.
Any help is much appreciated. Thanks a lot!

Seems to be doable with np.select() and groupby() , I'll try to reply in a while if I find some time — Juan C
– Juan C, Commented Aug 5, 2019 at 19:33

Juan C · Accepted Answer · 2019-08-06 13:58:32Z

Took a bit longer than expected. I don't have enough time for your second question (you should ask only one question when asking in SO, anyways), so I'll help you until step 2:

import pandas as pd
import numpy as np

id = ['793601486525702000','793601486525702000','793601710614802000','793601355214561000','793601355214561000','793601355214561000','793601355214561000','788130215436230000','788130215436230000','788130215436230000','788130215436230000','788130215436230000'] 
time = ['11/1/2016 16:53','11/1/2016 16:53','11/1/2016 16:52','11/1/2016 16:55','11/1/2016 16:53','11/1/2016 16:53','11/1/2016 16:51','11/1/2016 3:09','11/1/2016 3:04','11/1/2016 2:36','11/1/2016 2:08','11/1/2016 0:28'] 
rank = ['2','1','1','4','3','2','1','5','4','3','2','1'] 
flag =['c_reply','c_start','u_start','u_reply','c_reply','c_reply','u_start','c_reply','c_reply','u_reply','u_reply','u_start']
df = pd.DataFrame({"id": id, "time": time, "rank": rank, "flag": flag})

Let's start with the hardest condition:

ids_c3 = pd.DataFrame(df[df.flag=='c_reply'].groupby('id')['rank'].min())
ids_c3['reply'] = 2
df= df.merge(ids_c3, on=['id','rank'], how='left')

First, we found id's that have c_reply and obtained the minimum rank of those id's. Then turned into a dataFrame, and marked with 2. Then I merged it with the original dataframe to create the reply column. Now we're missing number 0, 1 and 3.

For numbers 1 and 0:

df['is_c_reply'] = df.groupby('id').flag.transform(lambda x: x.eq('c_reply').any())
c1= (df['rank']=='1') & (df.flag=='u_start') & (df.is_c_reply==0)
c2= (df['rank']=='1') & (df.flag=='u_start') & (df.is_c_reply==1)
df['reply'] = np.select([c1,c2],[0,1], default=df.reply)

We wrote the conditions you specified: c1 for 0 and c2 for 1. Then used np.select() to fill the reply column.

Now we're only missing 3. As stated, everything else is a 3, so you just fillna():

df.reply = df.reply.fillna(3)

We're done!

Possibly there's faster ways to do this, though

Collectives™ on Stack Overflow

How to use multiple conditions in different columns to update the new rows values in python?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related