0

I have a multi-index df:

This df outlines somebodies path through a website, sid is the session, vid is the visitor id, pid are the web pages and ts is the time in which they landed on the site

           pid    ts
sid vid 
 1   A    page1    t1
     A    page2    t2
     A    page3    t3
     A    page4    t4
     A    page5    t5
 2   B    page1    t4
 3   C    page1    t5
     C    page2    t6

Some users have ridiculously long pid paths (1000+) which I imagine could be an error. However when I transpose/pivot this data, it takes ages to transpose because a few paths which are so long.

So I want to impose some threshold where for every session after some number (lets say for example 3) it deletes the session sid

I can impose a threshold for the amount of rows which equals lets say 3, then the df would look like this:

           pid    ts
sid vid 
 2   B    page1    t4
 3   C    page1    t5
     C    page2    t6

Any idea on how to do this?

1 Answer 1

2

Sure, just use groupby+filter. In this case, it seems "sid" is the level 0 of a MultiIndex, so we can do:

df.groupby(level=0).filter(lambda x:len(x)<=3)

filter leaves only the groups where the lambda expression is true, which in this case means that the length (rows of a data frame) of the group is less or equal than 3.

Alternatively, you could leave the first, say, 3 rows of that group instead of eliminating it completely by doing:

df.groupby(level=0).head(3)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.