0

This is the current dataframe:

    id = ['793601486525702000','793601486525702000','793601710614802000','793601355214561000','793601355214561000','793601355214561000','793601355214561000','788130215436230000','788130215436230000','788130215436230000','788130215436230000','788130215436230000'] 
    time = ['11/1/2016 16:53','11/1/2016 16:53','11/1/2016 16:52','11/1/2016 16:55','11/1/2016 16:53','11/1/2016 16:53','11/1/2016 16:51','11/1/2016 3:09','11/1/2016 3:04','11/1/2016 2:36','11/1/2016 2:08','11/1/2016 0:28'] 
    rank = ['2','1','1','4','3','2','1','5','4','3','2','1'] 
    flag =['c_reply','c_start','u_start','u_reply','c_reply','c_reply','u_start','c_reply','c_reply','u_reply','u_reply','u_start']
    df = pd.DataFrame({"id": id, "time": time, "rank": rank, "flag": flag})

            id                time        rank     flag
                              .
                              .
    793601486525702000  11/1/2016 16:53    2      c_reply
    793601486525702000  11/1/2016 16:53    1      c_start
    793601710614802000  11/1/2016 16:52    1      u_start
    793601355214561000  11/1/2016 16:55    4      u_reply
    793601355214561000  11/1/2016 16:53    3      c_reply
    793601355214561000  11/1/2016 16:53    2      c_reply
    793601355214561000  11/1/2016 16:51    1      u_start
    788130215436230000  11/1/2016 3:09     5      c_reply
    788130215436230000  11/1/2016 3:04     4      c_reply
    788130215436230000  11/1/2016 2:36     3      u_reply
    788130215436230000  11/1/2016 2:08     2      u_reply
    788130215436230000  11/1/2016 0:28     1      u_start
                              .
                              .

My dataset has thousands of rows.
The column 'id': One id might have multiple rows/records. The rows have the same id means they are in the same group.
The column 'rank' is arranged by the chronological order of the same group of id.

I would like to use a loop or function to create two new columns 'reply' and 'reply_time' based on multiple columns: 'id', 'rank', 'time', and 'flag' in my dataframe.
Step 1: Select rows in the same id group (group by id column)
Step 2: Update 'reply' column value:The conditions I would like to set are as follows:

value '0' : rank = '1' and flag = 'u_start' and no 'c_reply' in flag column
value '1' : rank = '1' and flag = 'u_start' and has 'c_reply' in flag column
value '2' : the first/earliest c_reply in flag column. (if there's multiple c_reply, list the earliest c_reply (the smaller value in rank column))
value '3' : If the above conditions aren't met, the rows should be assigned to this category, including (1)rank = '1' and flag = 'c_start' OR (2)rank >= '2' and flag = 'u_reply' OR (3)rank >= '2' and flag = 'c_reply' and not the first c_reply in flag column OR (4) rank >= '2' and flag = 'c_reply' and no 'u_start' in flag column

Step 3: Update 'reply_time' column value:The conditions I would like to set are as follows:
value 'time': rank = '1' and flag = 'u_start' and has 'c_reply' in flag column, list the first/earliest 'c_reply' time.
value 'na': If the above conditions aren't met, the rows should be assigned to 'na'.

The target output would look something like this:

            id                 time       rank      flag   reply   reply_time
    793601486525702000  11/1/2016 16:53     2     c_reply    3      na
    793601486525702000  11/1/2016 16:53     1     c_start    3      na
    793601710614802000  11/1/2016 16:52     1     u_start    0      na
    793601355214561000  11/1/2016 16:55     4     u_reply    3      na
    793601355214561000  11/1/2016 16:53     3     c_reply    3      na
    793601355214561000  11/1/2016 16:53     2     c_reply    2      na
    793601355214561000  11/1/2016 16:51     1     u_start    1      11/1/2016 16:53
    788130215436230000  11/1/2016 3:09      5     c_reply    3      na
    788130215436230000  11/1/2016 3:04      4     c_reply    2      na
    788130215436230000  11/1/2016 2:36      3     u_reply    3      na
    788130215436230000  11/1/2016 2:08      2     u_reply    3      na
    788130215436230000  11/1/2016 0:28      1     u_start    1      11/1/2016 3:04

It seems like a simple question however I couldn't find it anywhere.
I used excel to do the manual coding now but I think there should be a faster way to solve this by using python.
Any help is much appreciated. Thanks a lot!

1
  • Seems to be doable with np.select() and groupby() , I'll try to reply in a while if I find some time Commented Aug 5, 2019 at 19:33

1 Answer 1

1

Took a bit longer than expected. I don't have enough time for your second question (you should ask only one question when asking in SO, anyways), so I'll help you until step 2:

import pandas as pd
import numpy as np

id = ['793601486525702000','793601486525702000','793601710614802000','793601355214561000','793601355214561000','793601355214561000','793601355214561000','788130215436230000','788130215436230000','788130215436230000','788130215436230000','788130215436230000'] 
time = ['11/1/2016 16:53','11/1/2016 16:53','11/1/2016 16:52','11/1/2016 16:55','11/1/2016 16:53','11/1/2016 16:53','11/1/2016 16:51','11/1/2016 3:09','11/1/2016 3:04','11/1/2016 2:36','11/1/2016 2:08','11/1/2016 0:28'] 
rank = ['2','1','1','4','3','2','1','5','4','3','2','1'] 
flag =['c_reply','c_start','u_start','u_reply','c_reply','c_reply','u_start','c_reply','c_reply','u_reply','u_reply','u_start']
df = pd.DataFrame({"id": id, "time": time, "rank": rank, "flag": flag})

Let's start with the hardest condition:

ids_c3 = pd.DataFrame(df[df.flag=='c_reply'].groupby('id')['rank'].min())
ids_c3['reply'] = 2
df= df.merge(ids_c3, on=['id','rank'], how='left')

First, we found id's that have c_reply and obtained the minimum rank of those id's. Then turned into a dataFrame, and marked with 2. Then I merged it with the original dataframe to create the reply column. Now we're missing number 0, 1 and 3.

For numbers 1 and 0:

df['is_c_reply'] = df.groupby('id').flag.transform(lambda x: x.eq('c_reply').any())
c1= (df['rank']=='1') & (df.flag=='u_start') & (df.is_c_reply==0)
c2= (df['rank']=='1') & (df.flag=='u_start') & (df.is_c_reply==1)
df['reply'] = np.select([c1,c2],[0,1], default=df.reply)

We wrote the conditions you specified: c1 for 0 and c2 for 1. Then used np.select() to fill the reply column.

Now we're only missing 3. As stated, everything else is a 3, so you just fillna():

df.reply = df.reply.fillna(3)

We're done!

Possibly there's faster ways to do this, though

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.