Delete rows based on a threshold

Question

I have a multi-index df:

This df outlines somebodies path through a website, sid is the session, vid is the visitor id, pid are the web pages and ts is the time in which they landed on the site

           pid    ts
sid vid 
 1   A    page1    t1
     A    page2    t2
     A    page3    t3
     A    page4    t4
     A    page5    t5
 2   B    page1    t4
 3   C    page1    t5
     C    page2    t6

Some users have ridiculously long pid paths (1000+) which I imagine could be an error. However when I transpose/pivot this data, it takes ages to transpose because a few paths which are so long.

So I want to impose some threshold where for every session after some number (lets say for example 3) it deletes the session sid

I can impose a threshold for the amount of rows which equals lets say 3, then the df would look like this:

           pid    ts
sid vid 
 2   B    page1    t4
 3   C    page1    t5
     C    page2    t6

Any idea on how to do this?

LeoArtaza · Accepted Answer · 2021-10-29 23:01:55Z

2

Sure, just use groupby+filter. In this case, it seems "sid" is the level 0 of a MultiIndex, so we can do:

df.groupby(level=0).filter(lambda x:len(x)<=3)

filter leaves only the groups where the lambda expression is true, which in this case means that the length (rows of a data frame) of the group is less or equal than 3.

Alternatively, you could leave the first, say, 3 rows of that group instead of eliminating it completely by doing:

df.groupby(level=0).head(3)

answered Oct 29, 2021 at 23:01

LeoArtaza

1831 silver badge7 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Delete rows based on a threshold

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related