1

I have a dataframe with the following structure:

event_timestamp      message_number  an_robot     check
2015-04-15 12:09:39  10125            robot_7     False
2015-04-15 12:09:41  10053            robot_4     True
2015-04-15 12:09:44  10156_ad         robot_7     True
2015-04-15 12:09:47  20205            robot_108   False
2015-04-15 12:09:51  10010            robot_38    True
2015-04-15 12:09:54  10012            robot_65    True
2015-04-15 12:09:59  10011            robot_39    True
2015-04-15 12:10:01  87954            robot_2     False
......etc

The check column gives insight in whether or not the row should be merged in this manner:

event timestamp: first
 message number: combine (e.g., 10053,10156)
       an_robot: combine (e.g., robot_4, robot_7)
          check: can be removed after the operation.

So far, I have succeeded using groupby to get the correct values for the True and False values in the check column:

df.groupby(by='check').agg({'event_timestamp':'first',
                            'message_number':lambda x: ','.join(x),
                            'an_robot':lambda x: ','.join(x)}.reset_index()

which outputs:

     check    event_timestamp        message_number         an_robot
0    False    2015-04-15 12:09:39    10125,10053,..,87954   robot_7,robot_4, ... etc
1    True     2015-04-15 12:09:51    10010,10012            robot_38,robot_65

However, the end result would ideally be the following. The 10053 and 10156_ad rows are combined and the 10010,10012,10011 rows are combined. In the full dataframe, the maximum length a sequence can be is 5. I have a separate dataframe with those rules (like the 10010,10012,10011 rule).

event_timestamp      message_number        an_robot
2015-04-15 12:09:39  10125                 robot_7
2015-04-15 12:09:41  10053,10156_ad        robot_4,robot_7
2015-04-15 12:09:47  20205                 robot_108
2015-04-15 12:09:51  10010,10012,10011     robot_38,robot_65,robot_39
2015-04-15 12:10:01  87954                 robot_2

How could I achieve this?

--EDIT--

The dataset with the separate rules looks like follows:

sequence             support
10053,10156,20205    0.94783
10010,10012          0.93322
10010,10033          0.93211
10053,10032          0.92222
etc....

the code that determines when a row in check will be true or false:

def find_drops(seq, df):
    if seq:
        m = np.logical_and.reduce([df.message_number.shift(-i).eq(seq[i]) for i in range(len(seq))])
        if len(seq) == 1:
            return pd.Series(m, index=df.index)
        else:
            return pd.Series(m, index=df.index).replace({False: np.NaN}).ffill(limit=len(seq)-1).fillna(False)
    else:
        return pd.Series(False, index=df.index)

If i then run df['check'] = find_drops(['10010', '10012', '10011'], df) i will get the check column with True's for the these rows. It would be great if it was possible to run this for each row in the dataframe with the rules and then merge the rows with the code provided.

--new code 4-17-2019--

df = """event_timestamp|message_number|an_robot
2015-04-15 12:09:39|10125|robot_7
2015-04-15 12:09:41|10053|robot_4
2015-04-15 12:09:44|10156_ad|robot_7
2015-04-15 12:09:47|20205|robot_108
2015-04-15 12:09:48|45689|robot_23
2015-04-15 12:09:51|10010|robot_38
2015-04-15 12:09:54|10012|robot_65
2015-04-15 12:09:58|98765|robot_99
2015-04-15 12:09:59|10011|robot_39
2015-04-15 12:10:01|87954|robot_2"""

df = pd.read_csv(io.StringIO(df), sep='|')

df1 = """sequence|support
10053,10156_ad,20205|0.94783
10010,10012|0.93322
10011,87954|0.92222
"""

df1 = pd.read_csv(io.StringIO(df1), sep='|')
patterns = df1['sequence'].str.split(',')

used_idx = []
c = ['event_timestamp','message_number','an_robot']
def find_drops(seq):
    if seq:
        m = np.logical_and.reduce([df.message_number.shift(-i).eq(seq[i]) for i in range(len(seq))])
        if len(seq) == 1:
            df2 = df.loc[m,  c].assign(g = df.index[m])
            used_idx.extend(df2.index.tolist())
            return df2
        else:
            m1 = (pd.Series(m, index=df.index).replace({False: np.NaN})
                                               .ffill(limit=len(seq)-1)
                                               .fillna(False))
            df2 = df.loc[m1,  c]
            used_idx.extend(df2.index.tolist())
            df2['g'] = np.where(df2.index.isin(df.index[m]), df2.index, np.nan)
            return df2


out = (pd.concat([find_drops(x) for x in patterns])
        .assign(g = lambda x: x['g'].ffill())
        .groupby(by=['g']).agg({'event_timestamp':'first',
                                 'message_number':','.join, 
                                 'an_robot':','.join})
        .reset_index(drop=True))

c = ['event_timestamp','message_number','an_robot']
df2 = df[~df.index.isin(used_idx)]
df2 = pd.DataFrame([[df2['event_timestamp'].iat[0], 
                    ','.join(df2['message_number']),
                    ','.join(df2['an_robot'])]], columns=c)

fin = pd.concat([out, df2], ignore_index=True)
fin.event_timestamp = pd.to_datetime(fin.event_timestamp)
fin = fin.sort_values('event_timestamp')
fin

output is:

event_timestamp      message_number           an_robot
2015-04-15 12:09:39  10125,45689,98765,12345  robot_7,robot_23,robot_99
2015-04-15 12:09:41  10053,10156_ad,20205     robot_4,robot_7,robot_108
2015-04-15 12:09:51  10010,10012              robot_38,robot_65
2015-04-15 12:09:59  10011,87954              robot_39,robot_2

should be:

event_timestamp      message_number        an_robot
2015-04-15 12:09:39  10125                 robot_7
2015-04-15 12:09:41  10053,10156_ad,20205  robot_4,robot_7,robot_108
2015-04-15 12:09:48  45689                 robot_23
2015-04-15 12:09:51  10010,10012           robot_38,robot_65
2015-04-15 12:09:58  98765                 robot_99
2015-04-15 12:09:59  10011,87954           robot_39,robot_2
2015-04-15 12:10:03  12345                 robot_1

2 Answers 2

1

Question is more complex, so complately changed.

First step is preprocessing - filter only values exist in sequencies by Series.isin and boolean indexing:

patterns = df1['sequence'].str.split(',')
print (patterns)

#flatten lists to sets
flatten = set([y for x in patterns for y in x])
#print (flatten)

df1 = df[df['message_number'].isin(flatten)]
#print (df1)

First solution is modified this answer - added groupby for sequencies with length > 1, call function for each value and last join together by concat:

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c

used_idx = []

def agg_pattern(seq):
    if seq:
        N = len(seq)
        arr = df1['message_number'].values
        b = np.all(rolling_window(arr, N) == seq, axis=1)
        c = np.mgrid[0:len(b)][b]

        d = [i  for x in c for i in range(x, x+N)]
        used_idx.extend(df1.index.values[d])
        m = np.in1d(np.arange(len(arr)), d)

        di = {'event_timestamp':'first','message_number':','.join, 'an_robot':','.join}

        if len(seq) == 1:
            return df1.loc[m, ['event_timestamp','message_number','an_robot']]
        else:
            df2 = df1[m]
            return df2.groupby(np.arange(len(df2)) // N).agg(di)


out = pd.concat([agg_pattern(x) for x in patterns], ignore_index=True)

Your solution should be changed for create helper column g used for grouping in last step:

used_idx = []
c = ['event_timestamp','message_number','an_robot']
def find_drops(seq):
    if seq:
        m = np.logical_and.reduce([df1.message_number.shift(-i).eq(seq[i]) for i in range(len(seq))])
        if len(seq) == 1:
            df2 = df1.loc[m,  c].assign(g = df1.index[m])
            used_idx.extend(df2.index.tolist())
            return df2
        else:
            m1 = (pd.Series(m, index=df1.index).replace({False: np.NaN})
                                               .ffill(limit=len(seq)-1)
                                               .fillna(False))
            df2 = df1.loc[m1,  c]
            used_idx.extend(df2.index.tolist())
            df2['g'] = np.where(df2.index.isin(df1.index[m]), df2.index, np.nan)
            return df2


out = (pd.concat([find_drops(x) for x in patterns])
        .assign(g = lambda x: x['g'].ffill())
        .groupby(by=['g']).agg({'event_timestamp':'first',
                                 'message_number':','.join, 
                                 'an_robot':','.join})
        .reset_index(drop=True))

print (used_idx)

Last create new DataFrame from False values and join to output:

print (out)
       event_timestamp        message_number                   an_robot
0  2015-04-15 12:09:41  10053,10156_ad,20205  robot_4,robot_7,robot_108
1  2015-04-15 12:09:51           10010,10012          robot_38,robot_65
2  2015-04-15 12:09:59           10011,87954           robot_39,robot_2

c = ['event_timestamp','message_number','an_robot']
df2 = pd.concat([out, df[~df.index.isin(used_idx)]]).sort_values('event_timestamp')
print(df2)
       event_timestamp        message_number                   an_robot
0  2015-04-15 12:09:39                 10125                    robot_7
0  2015-04-15 12:09:41  10053,10156_ad,20205  robot_4,robot_7,robot_108
4  2015-04-15 12:09:48                 45689                   robot_23
1  2015-04-15 12:09:51           10010,10012          robot_38,robot_65
7  2015-04-15 12:09:58                 98765                   robot_99
2  2015-04-15 12:09:59           10011,87954           robot_39,robot_2
Sign up to request clarification or add additional context in comments.

9 Comments

wow this works great indeed! i have also added the look of the dataframe with the rules. How could i convert these rules to the rules variable that you made? Also, would it still work for rules that have common codes (rule 3 in the dataframe has 10010 in it as well)?
oh wait, before you answer it. i will upload the code that determines wether or not check is true or false, maybe we can combine the codes.
both codes work great! How should these two solutions be altered to a version where the rows that are not in patterns/sequences df are also in the end dataframe, like in my expected output? Because now the end dataframe is only the rows that contain joined messagenumbers and robots.
i tried using the new code, however i get the error that Dataframe object has no attribute message_number. I changed the line in the find_drops function in the beginning to df.message_number, but that gave me the error that: wrong number of items passed 18, placement implies 4.
@intStdu - Edited answer.
|
1

You could classify the message number before you group them. Best would be to have these classifications rules in a dataframe, 1 classification per number.

class_df = pd.DataFrame(data={'message_number': ['10010', '10012', '10011', '10053', '10156_ad'],
                              'class': ['a', 'a', 'a', 'b', 'b']})

You can then merge them

results = pd.merge(df, class_df, on=['message_number'], how='left)

Then you can group by class and check

results.groupby(by=['check', 'class']).agg({'event_timestamp':'first',
                                            'message_number':lambda x: ','.join(x),
                                            'an_robot':lambda x: ','.join(x)}.reset_index()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.