0

I'd like to delete rows from my dataframe when the next one meets certain conditions. Let's say that my dataset is:

raw_data = {'SessionID': ['S1', 'S1', 'S1', 'S2', 'S2', 'S2', 'S2', 'S2', 'S3', 'S3', 'S3', 'S3', 'S3', 'S3'], 
    'Event Action': ['Action', 'Action', 'Filter', 'Action', 'Action', 'Action', 'Filter', 'Filter', 'Action', 'Filter','Action', 'Filter', 'Filter', 'Action'], 
    'Timestamp': ['T1.1', 'T1.2', 'T1.3', 'T1.1', 'T1.2', 'T1.3', 'T1.3', 'T1.4', 'T1.4', 'T1.5', 'T1.7', 'T1.7', 'T1.8', 'T1.9']}

df = pd.DataFrame(raw_data, columns = ['SessionID', 'Event Action', 'Timestamp'])

df

 SessionID  Event Action    Timestamp
0   S1         Action          T1.1
1   S1         Action          T1.2
2   S1         Filter          T1.3
3   S2         Action          T1.1
4   S2         Action          T1.2
5   S2         Action          T1.3
6   S2         Filter          T1.3
7   S2         Filter          T1.4
8   S3         Action          T1.4
9   S3         Filter          T1.5
10  S3         Action          T1.7
11  S3         Filter          T1.7
12  S3         Filter          T1.8
13  S3         Action          T1.9

Given any row and being row1 the next one, I want to delete row when:

if df[row:'SessionID'] == df[row1:'SessionID'] 
and df[row:'Event Action'] == 'Action' 
and df[row1:'Event Action'] == 'Filter' 
and df[row:'Timestamp'] == df[row1:'Timestamp']

For instance, in the dataset above the rows that should be eliminated are 5 and 10. I'm not that expert with fuctions in python, but I've tried:

def cleanfilter(row):
    row1 = row + 1
    if df[row:'SessionID'] == df[row1:'SessionID'] and df[row:'Event Action'] == 'Search Action'and df[row1:'Event Action'] == 'Search Filter' and df[row:'Timestamp'] == df[row1:'Timestamp']:
    df.drop(df.index[row])

df.apply(cleanfilter,axis=1)

But i'm constantly receving: "TypeError: ('must be str, not int', 'occurred at index 0')". I don't know what to google anymore... Any advice would be much appreciated! Thanks in advance.

2
  • What do you mean row1:'SessionID', I don't see a row 1 Commented Jul 12, 2018 at 14:59
  • My apologies for the bad formulation. row1 is the row immediately after row. So, taken any specific row, row1 is the one immediately after row Commented Jul 12, 2018 at 15:02

1 Answer 1

4

You can create masks for your conditions and then apply them to your df with a negation since we are looking to delete the rows that meet the conditions.

m1 = (df['SessionID'] == df['SessionID'].shift(-1))
m2 = (df['Event Action']=='Action')
m3 = (df['Event Action'].shift(-1)=='Filter')
m4 = (df['Timestamp']==df['Timestamp'].shift(-1))
df[~(m1 & m2 & m3 & m4)]

Output:

         SessionID Event Action Timestamp
0         S1       Action      T1.1
1         S1       Action      T1.2
2         S1       Filter      T1.3
3         S2       Action      T1.1
4         S2       Action      T1.2
6         S2       Filter      T1.3
7         S2       Filter      T1.4
8         S3       Action      T1.4
9         S3       Filter      T1.5
11        S3       Filter      T1.7
12        S3       Filter      T1.8
13        S3       Action      T1.9
Sign up to request clarification or add additional context in comments.

2 Comments

That perfectly works and is way more elegant than what I was trying to do. I've read now the .shift pandas documentation, cool. However I'm not fully understanding the last sentence: df[~(m1 & m2 & m3 & m4)]. Does it mean: print the dataframe excluding those conditions?
@E.Faslo, it returns the copy of the df that do not satisfy (m1 & m2 & m3 & m4) (all the conditions.) Rather than deleting the rows that satisfy the conditions, we are looking for the rows that do not satisfy conditions. I hope it clears your doubts.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.