3

I am working with weblogs and have data containing account_id and session_id. Multiple sessions can be associated with one account. I want to create a new dataframe containing account_id and count the number of unique sessions associated with that account. My df looks like this:

account_id session_id
 1111          de322
 1111          de322
 1111          de322
 1111          de323
 1111          de323
 0210          ge012
 0210          ge013
 0211          ge330
 0213          ge333

I'm using this code:

new_df = df.groupby(['account_id','session_id']).sum()

The output I am getting is below:

 account_id     sessions
 1111           de322
                de323
 0210           ge012 
                ge013 
 0211           ge330
 0213           ge333

The output I'm expecting

account_id   sessions
 1111           2
 0210           2  
 0211           1
 0213           1

How should I fix it?

0

1 Answer 1

3
df = pd.DataFrame({'session': ['de322', 'de322', 'de322', 'de323', 'de323', 'ge012', 'ge012', 'ge013', 'ge333'],
                   'user_id': [1111, 1111, 1111, 1111, 1111, 210, 210, 210, 211],
                   })
print(df)


df = df.drop_duplicates().groupby('user_id').count()
print(df)

output:

user_id
210     2
211     1
1111    2
Sign up to request clarification or add additional context in comments.

5 Comments

In your script, you mixed account_id with session id and the numbers I'm expecting are not correct, still. Within the account_id 1111, there are 2 UNIQUE sessions, although 5 events. I am trying to count unique sessions per account, not a total number of sessions.
ok let me write code again
see i have updated
Thank you very much for your help, it does work!
can you accept the answer?

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.