2

I have a data set (below), where I want to group the data by the user_id and get the count for each cluster_label for each user_id. The purpose of which is to find out how many times each user went to each cluster they visited.

Essentially, I am looking for a result that returns this information (it can be in a list, dict, or comma separated):

user_id,          cluster 54, cluster 109, cluster 191, cluster 204, cluster 260, cluster 263, cluster 264, cluster 278, cluster 290
819000000000000000, 1        1             2             1           3             1           1           1              1           

I've tried the following code:

data['user_id'] = data.index
result = data.groupby(['user_id','cluster_label']).count() 

and

groupby = data.groupby('user_id').filter(lambda x: len(x['user_id'])>=2)

#sort user locations by time
groupsort = groupby.sort_values(by='timestamp')
f = lambda x: [list(x)]
trajs = groupsort.groupby('user_id')['cluster_label'].apply(f).reset_index()

The second code block get me closer to what I'm looking for, but I have not been able to figure out the counting portion:

790068    [[485, 256, 304, 311, 311, 311, 311, 417, 417]]

Data:

user_id,timestamp,latitude,longitude,cluster_label
822000000000000000,3/28/2017 22:31,38.7842,-77164,634
822000000000000000,3/28/2017 22:44,38.7842,-77164,634
822000000000000000,3/29/2017 8:02,38.8976805,-77387238,413
822000000000000000,3/29/2017 8:21,38.8976805,-77387238,413
822000000000000000,3/29/2017 19:58,38.8976805,-77387238,413
822000000000000000,3/29/2017 22:12,38.8976805,-77387238,413
822000000000000000,3/30/2017 9:07,38.8976805,-77387238,413
822000000000000000,3/30/2017 10:27,38.8976805,-77387238,413
822000000000000000,3/30/2017 17:17,38.8976805,-77387238,413
822000000000000000,3/30/2017 17:19,38.8976805,-77387238,413
822000000000000000,3/30/2017 17:19,38.8976805,-77387238,413
822000000000000000,3/30/2017 17:20,38.8976805,-77387238,413
822000000000000000,3/30/2017 17:22,38.8976805,-77387238,413
822000000000000000,3/30/2017 18:16,38.8976805,-77387238,413
822000000000000000,3/30/2017 18:17,38.8976805,-77387238,413
822000000000000000,3/30/2017 21:43,38.8976805,-77387238,413
822000000000000000,3/31/2017 7:04,38.8976805,-77387238,413
821000000000000000,3/9/2017 19:06,39.1328,-76.694,35
821000000000000000,3/9/2017 19:07,393426644,-76.6874899,90
821000000000000000,3/9/2017 19:07,38.93730032,-778885944,207
821000000000000000,3/9/2017 19:07,38.9071923,-77368707,327
821000000000000000,3/9/2017 19:06,38.8940974,-77276216,438
821000000000000000,3/9/2017 19:07,38.882584,-77.1124701,521
821000000000000000,3/9/2017 19:08,38.8577901,-76.8538565,565
821000000000000000,3/27/2017 21:12,38.888108,-771978416,485
820000000000000000,3/9/2017 19:09,39535541,-77.1347642,77
820000000000000000,3/9/2017 19:08,38.9847,-77.1131,143
820000000000000000,3/22/2017 14:26,38.8951,-77367,432
820000000000000000,3/24/2017 19:13,39227,-77.1864,98
820000000000000000,3/30/2017 7:39,39227,-77.1864,98
819000000000000000,3/9/2017 19:09,39942239,-76.85709,54
819000000000000000,3/9/2017 19:11,39042,-7719,109
819000000000000000,3/9/2017 19:16,38.95315,-77.447735,191
819000000000000000,3/9/2017 19:10,38.95278983,-77.44791904,191
819000000000000000,3/9/2017 19:12,38.94033497,-77.17591993,204
819000000000000000,3/9/2017 19:09,38.917866,-7723722,260
819000000000000000,3/9/2017 19:09,38.917866,-7723722,260
819000000000000000,3/9/2017 19:09,38.917866,-7723722,260
819000000000000000,3/9/2017 19:15,38.91778,-76.9769,263
819000000000000000,3/9/2017 19:12,38.916489,-77318051,264
819000000000000000,3/9/2017 19:12,38.915147,-77217751,278
819000000000000000,3/9/2017 19:15,38.912068,-77190228,290

1 Answer 1

1

I think you can use alternative for count with GroupBy.size and reshape by Series.unstack with replace missing values or not:

result = data.groupby(['user_id','cluster_label']).size().unstack(fill_value=0)
print (result)
cluster_label       35   54   77   90   98   109  143  191  204  207  ...  \
user_id                                                               ...   
819000000000000000    0    1    0    0    0    1    0    2    1    0  ...   
820000000000000000    0    0    1    0    2    0    1    0    0    0  ...   
821000000000000000    1    0    0    1    0    0    0    0    0    1  ...   
822000000000000000    0    0    0    0    0    0    0    0    0    0  ...   

cluster_label       278  290  327  413  432  438  485  521  565  634  
user_id                                                               
819000000000000000    1    1    0    0    0    0    0    0    0    0  
820000000000000000    0    0    0    0    1    0    0    0    0    0  
821000000000000000    0    0    1    0    0    1    1    1    1    0  
822000000000000000    0    0    0   15    0    0    0    0    0    2  

[4 rows x 23 columns]

result = data.groupby(['user_id','cluster_label']).size().unstack()
print (result)

cluster_label       35   54   77   90   98   109  143  191  204  207  ...  \
user_id                                                               ...   
819000000000000000  NaN  1.0  NaN  NaN  NaN  1.0  NaN  2.0  1.0  NaN  ...   
820000000000000000  NaN  NaN  1.0  NaN  2.0  NaN  1.0  NaN  NaN  NaN  ...   
821000000000000000  1.0  NaN  NaN  1.0  NaN  NaN  NaN  NaN  NaN  1.0  ...   
822000000000000000  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...   

cluster_label       278  290  327   413  432  438  485  521  565  634  
user_id                                                                
819000000000000000  1.0  1.0  NaN   NaN  NaN  NaN  NaN  NaN  NaN  NaN  
820000000000000000  NaN  NaN  NaN   NaN  1.0  NaN  NaN  NaN  NaN  NaN  
821000000000000000  NaN  NaN  1.0   NaN  NaN  1.0  1.0  1.0  1.0  NaN  
822000000000000000  NaN  NaN  NaN  15.0  NaN  NaN  NaN  NaN  NaN  2.0  

[4 rows x 23 columns]

Or use crosstab:

result = pd.crosstab(data['user_id'],data['cluster_label'])
print (result)
cluster_label       35   54   77   90   98   109  143  191  204  207  ...  \
user_id                                                               ...   
819000000000000000    0    1    0    0    0    1    0    2    1    0  ...   
820000000000000000    0    0    1    0    2    0    1    0    0    0  ...   
821000000000000000    1    0    0    1    0    0    0    0    0    1  ...   
822000000000000000    0    0    0    0    0    0    0    0    0    0  ...   

cluster_label       278  290  327  413  432  438  485  521  565  634  
user_id                                                               
819000000000000000    1    1    0    0    0    0    0    0    0    0  
820000000000000000    0    0    0    0    1    0    0    0    0    0  
821000000000000000    0    0    1    0    0    1    1    1    1    0  
822000000000000000    0    0    0   15    0    0    0    0    0    2  

[4 rows x 23 columns]
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, this is great! I was unaware of the Series.unstack, but I see it performs a pivot and is extremely useful. However, I ran this on my larger dataset, and I've come to realize that this may not be the best solution for what I'm ultimately trying to accomplish because I have over 700 clusters and the result of this solution is a very large and sparse matrix. I will accept this and keep exploring other more efficient (hopefully) solutions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.