This is the current dataframe:
id = ['793601486525702000','793601486525702000','793601710614802000','793601355214561000','793601355214561000','793601355214561000','793601355214561000','788130215436230000','788130215436230000','788130215436230000','788130215436230000','788130215436230000']
time = ['11/1/2016 16:53','11/1/2016 16:53','11/1/2016 16:52','11/1/2016 16:55','11/1/2016 16:53','11/1/2016 16:53','11/1/2016 16:51','11/1/2016 3:09','11/1/2016 3:04','11/1/2016 2:36','11/1/2016 2:08','11/1/2016 0:28']
rank = ['2','1','1','4','3','2','1','5','4','3','2','1']
flag =['c_reply','c_start','u_start','u_reply','c_reply','c_reply','u_start','c_reply','c_reply','u_reply','u_reply','u_start']
df = pd.DataFrame({"id": id, "time": time, "rank": rank, "flag": flag})
id time rank flag
.
.
793601486525702000 11/1/2016 16:53 2 c_reply
793601486525702000 11/1/2016 16:53 1 c_start
793601710614802000 11/1/2016 16:52 1 u_start
793601355214561000 11/1/2016 16:55 4 u_reply
793601355214561000 11/1/2016 16:53 3 c_reply
793601355214561000 11/1/2016 16:53 2 c_reply
793601355214561000 11/1/2016 16:51 1 u_start
788130215436230000 11/1/2016 3:09 5 c_reply
788130215436230000 11/1/2016 3:04 4 c_reply
788130215436230000 11/1/2016 2:36 3 u_reply
788130215436230000 11/1/2016 2:08 2 u_reply
788130215436230000 11/1/2016 0:28 1 u_start
.
.
My dataset has thousands of rows.
The column 'id': One id might have multiple rows/records. The rows have the same id means they are in the same group.
The column 'rank' is arranged by the chronological order of the same group of id.
I would like to use a loop or function to create two new columns 'reply' and 'reply_time' based on multiple columns: 'id', 'rank', 'time', and 'flag' in my dataframe.
Step 1: Select rows in the same id group (group by id column)
Step 2: Update 'reply' column value:The conditions I would like to set are as follows:
value '0' : rank = '1' and flag = 'u_start' and no 'c_reply' in flag column
value '1' : rank = '1' and flag = 'u_start' and has 'c_reply' in flag column
value '2' : the first/earliest c_reply in flag column. (if there's multiple c_reply, list the earliest c_reply (the smaller value in rank column))
value '3' : If the above conditions aren't met, the rows should be assigned to this category, including (1)rank = '1' and flag = 'c_start' OR (2)rank >= '2' and flag = 'u_reply' OR (3)rank >= '2' and flag = 'c_reply' and not the first c_reply in flag column OR (4) rank >= '2' and flag = 'c_reply' and no 'u_start' in flag column
Step 3: Update 'reply_time' column value:The conditions I would like to set are as follows:
value 'time': rank = '1' and flag = 'u_start' and has 'c_reply' in flag column, list the first/earliest 'c_reply' time.
value 'na': If the above conditions aren't met, the rows should be assigned to 'na'.
The target output would look something like this:
id time rank flag reply reply_time
793601486525702000 11/1/2016 16:53 2 c_reply 3 na
793601486525702000 11/1/2016 16:53 1 c_start 3 na
793601710614802000 11/1/2016 16:52 1 u_start 0 na
793601355214561000 11/1/2016 16:55 4 u_reply 3 na
793601355214561000 11/1/2016 16:53 3 c_reply 3 na
793601355214561000 11/1/2016 16:53 2 c_reply 2 na
793601355214561000 11/1/2016 16:51 1 u_start 1 11/1/2016 16:53
788130215436230000 11/1/2016 3:09 5 c_reply 3 na
788130215436230000 11/1/2016 3:04 4 c_reply 2 na
788130215436230000 11/1/2016 2:36 3 u_reply 3 na
788130215436230000 11/1/2016 2:08 2 u_reply 3 na
788130215436230000 11/1/2016 0:28 1 u_start 1 11/1/2016 3:04
It seems like a simple question however I couldn't find it anywhere.
I used excel to do the manual coding now but I think there should be a faster way to solve this by using python.
Any help is much appreciated. Thanks a lot!
np.select()andgroupby(), I'll try to reply in a while if I find some time