0

I have below data:

from datetime import date, timedelta
import pandas as pd
import numpy as np
sdate = date(2019,1,1)   # start date
edate = date(2019,1,7)   # end date -6days

required_dates = pd.date_range(sdate,edate-timedelta(days=1),freq='d')
# initialize list of lists 
data = [['2019-01-01', 1000,101], ['2019-01-03', 1000,201] ,['2019-01-02', 1500,301], 
        ['2019-01-02', 1400,101],['2019-01-04', 1500,201],['2019-01-01', 2000,201],
        ['2019-01-04', 2000,101],['2019-01-04', 1400,301],['2019-01-05', 1400,301],['2019-01-05', 1400,301]]
# Create the pandas DataFrame 
df1 = pd.DataFrame(data, columns = ['OnlyDate', 'TBID','UserID'])
df1=df1[['OnlyDate','UserID','TBID']]
df1.sort_values(by=['UserID','TBID'],inplace=True)
df1.reset_index(inplace=True,drop=True)
df1


    OnlyDate    UserID  TBID
0   2019-01-01  101 1000
1   2019-01-02  101 1400
2   2019-01-04  101 2000
3   2019-01-03  201 1000
4   2019-01-04  201 1500
5   2019-01-01  201 2000
6   2019-01-04  301 1400
7   2019-01-05  301 1400
8   2019-01-05  301 1400
9   2019-01-02  301 1500 

What I want get is outputDataFrame for each UserID like below :

Desired ouput for USERID = 101

ActualValues    TBID  UserID
    OnlyDate        
    2019-01-01  1   1000   101
    2019-01-02  0   1000   101
    2019-01-03  0   1000   101
    2019-01-04  0   1000   101
    2019-01-05  0   1000   101

    2019-01-01  0   1400   101
    2019-01-02  1   1400   101
    2019-01-03  0   1400   101
    2019-01-04  0   1400   101
    2019-01-05  0   1400   101

    2019-01-01  0   1500   101
    2019-01-02  0   1500   101
    2019-01-03  0   1500   101
    2019-01-04  0   1500   101
    2019-01-05  0   1500   101

    2019-01-01  0   2000   101
    2019-01-02  0   2000   101
    2019-01-03  0   2000   101
    2019-01-04  1   2000   101
    2019-01-05  0   2000   101

for USERID = 301

    2019-01-01  0   1000   301
    2019-01-02  0   1000   301
    2019-01-03  0   1000   301
    2019-01-04  0   1000   301
    2019-01-05  0   1000   301

    2019-01-01  0   1400   301
    2019-01-02  0   1400   301
    2019-01-03  0   1400   301
    2019-01-04  1   1400   301
    2019-01-05  2   1400   301

    2019-01-01  0   1500   301
    2019-01-02  1   1500   301
    2019-01-03  0   1500   301
    2019-01-04  0   1500   301
    2019-01-05  0   1500   301

    2019-01-01  0   2000   301
    2019-01-02  0   2000   301
    2019-01-03  0   2000   301
    2019-01-04  0   2000   301
    2019-01-05  0   2000   301

I tried this one which is not desired:

x= pd.get_dummies(data=df1, columns=['TBID']).groupby(['OnlyDate','UserID']).sum()
x


   
            TBID_1000   TBID_1400   TBID_1500   TBID_2000
OnlyDate    UserID              
2019-01-01  101 1   0   0   0
            201 0   0   0   1
2019-01-02  101 0   1   0   0
            301 0   0   1   0
2019-01-03  201 1   0   0   0
2019-01-04  101 0   0   0   1
            201 0   0   1   0
            301 0   1   0   0
2019-01-05  301 0   2   0   0

How can I get such output?

2 Answers 2

2

Use GroupBy.size with Series.reindex:

df = df1.groupby(['OnlyDate','UserID','TBID']).size()
mux = pd.MultiIndex.from_product(df.index.levels)
df = df.reindex(mux, fill_value=0).sort_index(level=[1,2,0]).reset_index(name='count')

print (df.head(10))
     OnlyDate  UserID  TBID  count
0  2019-01-01     101  1000      1
1  2019-01-02     101  1000      0
2  2019-01-03     101  1000      0
3  2019-01-04     101  1000      0
4  2019-01-05     101  1000      0
5  2019-01-01     101  1400      0
6  2019-01-02     101  1400      1
7  2019-01-03     101  1400      0
8  2019-01-04     101  1400      0
9  2019-01-05     101  1400      0
Sign up to request clarification or add additional context in comments.

2 Comments

I don't think the UserID 301 part is right...(missing 2000)
Hi @jezrael , Its only sort_index(level=[1,0]) as level-2 is not there. Overall it doesn't provide the exact solution, but it gave me some hint. Even I am facing memory Error as my dataframe is of shape [250K ,4]
1

Basic idea is to conduct groupby size. The nuisance is to fill the missing index with a 0 value, can be achieved by reindex or data reshaping. Below is the reshaping approach,

(df1.groupby(['OnlyDate','UserID','TBID'])
    .size()
    .unstack('OnlyDate', fill_value=0) 
    .unstack('UserID', fill_value=0)
    .unstack()
    .reset_index(name='count'))

1 Comment

Exactly , I need to understand the unstack machanism . Thanks mark

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.