aggregate and replace rows in pandas

Question

I have a dataframe with the following structure:

event_timestamp      message_number  an_robot     check
2015-04-15 12:09:39  10125            robot_7     False
2015-04-15 12:09:41  10053            robot_4     True
2015-04-15 12:09:44  10156_ad         robot_7     True
2015-04-15 12:09:47  20205            robot_108   False
2015-04-15 12:09:51  10010            robot_38    True
2015-04-15 12:09:54  10012            robot_65    True
2015-04-15 12:09:59  10011            robot_39    True
2015-04-15 12:10:01  87954            robot_2     False
......etc

The check column gives insight in whether or not the row should be merged in this manner:

event timestamp: first
 message number: combine (e.g., 10053,10156)
       an_robot: combine (e.g., robot_4, robot_7)
          check: can be removed after the operation.

So far, I have succeeded using groupby to get the correct values for the True and False values in the check column:

df.groupby(by='check').agg({'event_timestamp':'first',
                            'message_number':lambda x: ','.join(x),
                            'an_robot':lambda x: ','.join(x)}.reset_index()

which outputs:

     check    event_timestamp        message_number         an_robot
0    False    2015-04-15 12:09:39    10125,10053,..,87954   robot_7,robot_4, ... etc
1    True     2015-04-15 12:09:51    10010,10012            robot_38,robot_65

However, the end result would ideally be the following. The 10053 and 10156_ad rows are combined and the 10010,10012,10011 rows are combined. In the full dataframe, the maximum length a sequence can be is 5. I have a separate dataframe with those rules (like the 10010,10012,10011 rule).

event_timestamp      message_number        an_robot
2015-04-15 12:09:39  10125                 robot_7
2015-04-15 12:09:41  10053,10156_ad        robot_4,robot_7
2015-04-15 12:09:47  20205                 robot_108
2015-04-15 12:09:51  10010,10012,10011     robot_38,robot_65,robot_39
2015-04-15 12:10:01  87954                 robot_2

How could I achieve this?

--EDIT--

The dataset with the separate rules looks like follows:

sequence             support
10053,10156,20205    0.94783
10010,10012          0.93322
10010,10033          0.93211
10053,10032          0.92222
etc....

the code that determines when a row in check will be true or false:

def find_drops(seq, df):
    if seq:
        m = np.logical_and.reduce([df.message_number.shift(-i).eq(seq[i]) for i in range(len(seq))])
        if len(seq) == 1:
            return pd.Series(m, index=df.index)
        else:
            return pd.Series(m, index=df.index).replace({False: np.NaN}).ffill(limit=len(seq)-1).fillna(False)
    else:
        return pd.Series(False, index=df.index)

If i then run df['check'] = find_drops(['10010', '10012', '10011'], df) i will get the check column with True's for the these rows. It would be great if it was possible to run this for each row in the dataframe with the rules and then merge the rows with the code provided.

--new code 4-17-2019--

df = """event_timestamp|message_number|an_robot
2015-04-15 12:09:39|10125|robot_7
2015-04-15 12:09:41|10053|robot_4
2015-04-15 12:09:44|10156_ad|robot_7
2015-04-15 12:09:47|20205|robot_108
2015-04-15 12:09:48|45689|robot_23
2015-04-15 12:09:51|10010|robot_38
2015-04-15 12:09:54|10012|robot_65
2015-04-15 12:09:58|98765|robot_99
2015-04-15 12:09:59|10011|robot_39
2015-04-15 12:10:01|87954|robot_2"""

df = pd.read_csv(io.StringIO(df), sep='|')

df1 = """sequence|support
10053,10156_ad,20205|0.94783
10010,10012|0.93322
10011,87954|0.92222
"""

df1 = pd.read_csv(io.StringIO(df1), sep='|')
patterns = df1['sequence'].str.split(',')

used_idx = []
c = ['event_timestamp','message_number','an_robot']
def find_drops(seq):
    if seq:
        m = np.logical_and.reduce([df.message_number.shift(-i).eq(seq[i]) for i in range(len(seq))])
        if len(seq) == 1:
            df2 = df.loc[m,  c].assign(g = df.index[m])
            used_idx.extend(df2.index.tolist())
            return df2
        else:
            m1 = (pd.Series(m, index=df.index).replace({False: np.NaN})
                                               .ffill(limit=len(seq)-1)
                                               .fillna(False))
            df2 = df.loc[m1,  c]
            used_idx.extend(df2.index.tolist())
            df2['g'] = np.where(df2.index.isin(df.index[m]), df2.index, np.nan)
            return df2


out = (pd.concat([find_drops(x) for x in patterns])
        .assign(g = lambda x: x['g'].ffill())
        .groupby(by=['g']).agg({'event_timestamp':'first',
                                 'message_number':','.join, 
                                 'an_robot':','.join})
        .reset_index(drop=True))

c = ['event_timestamp','message_number','an_robot']
df2 = df[~df.index.isin(used_idx)]
df2 = pd.DataFrame([[df2['event_timestamp'].iat[0], 
                    ','.join(df2['message_number']),
                    ','.join(df2['an_robot'])]], columns=c)

fin = pd.concat([out, df2], ignore_index=True)
fin.event_timestamp = pd.to_datetime(fin.event_timestamp)
fin = fin.sort_values('event_timestamp')
fin

output is:

event_timestamp      message_number           an_robot
2015-04-15 12:09:39  10125,45689,98765,12345  robot_7,robot_23,robot_99
2015-04-15 12:09:41  10053,10156_ad,20205     robot_4,robot_7,robot_108
2015-04-15 12:09:51  10010,10012              robot_38,robot_65
2015-04-15 12:09:59  10011,87954              robot_39,robot_2

should be:

event_timestamp      message_number        an_robot
2015-04-15 12:09:39  10125                 robot_7
2015-04-15 12:09:41  10053,10156_ad,20205  robot_4,robot_7,robot_108
2015-04-15 12:09:48  45689                 robot_23
2015-04-15 12:09:51  10010,10012           robot_38,robot_65
2015-04-15 12:09:58  98765                 robot_99
2015-04-15 12:09:59  10011,87954           robot_39,robot_2
2015-04-15 12:10:03  12345                 robot_1

jezrael · Accepted Answer · 2019-04-18 07:44:23Z

1

Question is more complex, so complately changed.

First step is preprocessing - filter only values exist in sequencies by Series.isin and boolean indexing:

patterns = df1['sequence'].str.split(',')
print (patterns)

#flatten lists to sets
flatten = set([y for x in patterns for y in x])
#print (flatten)

df1 = df[df['message_number'].isin(flatten)]
#print (df1)

First solution is modified this answer - added groupby for sequencies with length > 1, call function for each value and last join together by concat:

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c

used_idx = []

def agg_pattern(seq):
    if seq:
        N = len(seq)
        arr = df1['message_number'].values
        b = np.all(rolling_window(arr, N) == seq, axis=1)
        c = np.mgrid[0:len(b)][b]

        d = [i  for x in c for i in range(x, x+N)]
        used_idx.extend(df1.index.values[d])
        m = np.in1d(np.arange(len(arr)), d)

        di = {'event_timestamp':'first','message_number':','.join, 'an_robot':','.join}

        if len(seq) == 1:
            return df1.loc[m, ['event_timestamp','message_number','an_robot']]
        else:
            df2 = df1[m]
            return df2.groupby(np.arange(len(df2)) // N).agg(di)


out = pd.concat([agg_pattern(x) for x in patterns], ignore_index=True)

Your solution should be changed for create helper column g used for grouping in last step:

used_idx = []
c = ['event_timestamp','message_number','an_robot']
def find_drops(seq):
    if seq:
        m = np.logical_and.reduce([df1.message_number.shift(-i).eq(seq[i]) for i in range(len(seq))])
        if len(seq) == 1:
            df2 = df1.loc[m,  c].assign(g = df1.index[m])
            used_idx.extend(df2.index.tolist())
            return df2
        else:
            m1 = (pd.Series(m, index=df1.index).replace({False: np.NaN})
                                               .ffill(limit=len(seq)-1)
                                               .fillna(False))
            df2 = df1.loc[m1,  c]
            used_idx.extend(df2.index.tolist())
            df2['g'] = np.where(df2.index.isin(df1.index[m]), df2.index, np.nan)
            return df2


out = (pd.concat([find_drops(x) for x in patterns])
        .assign(g = lambda x: x['g'].ffill())
        .groupby(by=['g']).agg({'event_timestamp':'first',
                                 'message_number':','.join, 
                                 'an_robot':','.join})
        .reset_index(drop=True))

print (used_idx)

Last create new DataFrame from False values and join to output:

print (out)
       event_timestamp        message_number                   an_robot
0  2015-04-15 12:09:41  10053,10156_ad,20205  robot_4,robot_7,robot_108
1  2015-04-15 12:09:51           10010,10012          robot_38,robot_65
2  2015-04-15 12:09:59           10011,87954           robot_39,robot_2

c = ['event_timestamp','message_number','an_robot']
df2 = pd.concat([out, df[~df.index.isin(used_idx)]]).sort_values('event_timestamp')
print(df2)
       event_timestamp        message_number                   an_robot
0  2015-04-15 12:09:39                 10125                    robot_7
0  2015-04-15 12:09:41  10053,10156_ad,20205  robot_4,robot_7,robot_108
4  2015-04-15 12:09:48                 45689                   robot_23
1  2015-04-15 12:09:51           10010,10012          robot_38,robot_65
7  2015-04-15 12:09:58                 98765                   robot_99
2  2015-04-15 12:09:59           10011,87954           robot_39,robot_2

edited Apr 18, 2019 at 7:44

answered Apr 16, 2019 at 12:25

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

intStdu Over a year ago

wow this works great indeed! i have also added the look of the dataframe with the rules. How could i convert these rules to the rules variable that you made? Also, would it still work for rules that have common codes (rule 3 in the dataframe has 10010 in it as well)?

intStdu Over a year ago

oh wait, before you answer it. i will upload the code that determines wether or not check is true or false, maybe we can combine the codes.

intStdu Over a year ago

both codes work great! How should these two solutions be altered to a version where the rows that are not in patterns/sequences df are also in the end dataframe, like in my expected output? Because now the end dataframe is only the rows that contain joined messagenumbers and robots.

intStdu Over a year ago

i tried using the new code, however i get the error that Dataframe object has no attribute message_number. I changed the line in the find_drops function in the beginning to df.message_number, but that gave me the error that: wrong number of items passed 18, placement implies 4.

jezrael Over a year ago

@intStdu - Edited answer.

|

tvgriek · Accepted Answer · 2019-04-16 12:19:55Z

1

You could classify the message number before you group them. Best would be to have these classifications rules in a dataframe, 1 classification per number.

class_df = pd.DataFrame(data={'message_number': ['10010', '10012', '10011', '10053', '10156_ad'],
                              'class': ['a', 'a', 'a', 'b', 'b']})

You can then merge them

results = pd.merge(df, class_df, on=['message_number'], how='left)

Then you can group by class and check

results.groupby(by=['check', 'class']).agg({'event_timestamp':'first',
                                            'message_number':lambda x: ','.join(x),
                                            'an_robot':lambda x: ','.join(x)}.reset_index()

answered Apr 16, 2019 at 12:19

tvgriek

1,2659 silver badges20 bronze badges

Collectives™ on Stack Overflow

aggregate and replace rows in pandas

2 Answers 2

9 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related