Map Data in pandas

Question

I have below data:

from datetime import date, timedelta
import pandas as pd
import numpy as np
sdate = date(2019,1,1)   # start date
edate = date(2019,1,7)   # end date -6days

required_dates = pd.date_range(sdate,edate-timedelta(days=1),freq='d')
# initialize list of lists 
data = [['2019-01-01', 1000,101], ['2019-01-03', 1000,201] ,['2019-01-02', 1500,301], 
        ['2019-01-02', 1400,101],['2019-01-04', 1500,201],['2019-01-01', 2000,201],
        ['2019-01-04', 2000,101],['2019-01-04', 1400,301],['2019-01-05', 1400,301],['2019-01-05', 1400,301]]
# Create the pandas DataFrame 
df1 = pd.DataFrame(data, columns = ['OnlyDate', 'TBID','UserID'])
df1=df1[['OnlyDate','UserID','TBID']]
df1.sort_values(by=['UserID','TBID'],inplace=True)
df1.reset_index(inplace=True,drop=True)
df1


    OnlyDate    UserID  TBID
0   2019-01-01  101 1000
1   2019-01-02  101 1400
2   2019-01-04  101 2000
3   2019-01-03  201 1000
4   2019-01-04  201 1500
5   2019-01-01  201 2000
6   2019-01-04  301 1400
7   2019-01-05  301 1400
8   2019-01-05  301 1400
9   2019-01-02  301 1500

What I want get is outputDataFrame for each UserID like below :

Desired ouput for USERID = 101

ActualValues    TBID  UserID
    OnlyDate        
    2019-01-01  1   1000   101
    2019-01-02  0   1000   101
    2019-01-03  0   1000   101
    2019-01-04  0   1000   101
    2019-01-05  0   1000   101

    2019-01-01  0   1400   101
    2019-01-02  1   1400   101
    2019-01-03  0   1400   101
    2019-01-04  0   1400   101
    2019-01-05  0   1400   101

    2019-01-01  0   1500   101
    2019-01-02  0   1500   101
    2019-01-03  0   1500   101
    2019-01-04  0   1500   101
    2019-01-05  0   1500   101

    2019-01-01  0   2000   101
    2019-01-02  0   2000   101
    2019-01-03  0   2000   101
    2019-01-04  1   2000   101
    2019-01-05  0   2000   101

for USERID = 301

    2019-01-01  0   1000   301
    2019-01-02  0   1000   301
    2019-01-03  0   1000   301
    2019-01-04  0   1000   301
    2019-01-05  0   1000   301

    2019-01-01  0   1400   301
    2019-01-02  0   1400   301
    2019-01-03  0   1400   301
    2019-01-04  1   1400   301
    2019-01-05  2   1400   301

    2019-01-01  0   1500   301
    2019-01-02  1   1500   301
    2019-01-03  0   1500   301
    2019-01-04  0   1500   301
    2019-01-05  0   1500   301

    2019-01-01  0   2000   301
    2019-01-02  0   2000   301
    2019-01-03  0   2000   301
    2019-01-04  0   2000   301
    2019-01-05  0   2000   301

I tried this one which is not desired:

x= pd.get_dummies(data=df1, columns=['TBID']).groupby(['OnlyDate','UserID']).sum()
x


   
            TBID_1000   TBID_1400   TBID_1500   TBID_2000
OnlyDate    UserID              
2019-01-01  101 1   0   0   0
            201 0   0   0   1
2019-01-02  101 0   1   0   0
            301 0   0   1   0
2019-01-03  201 1   0   0   0
2019-01-04  101 0   0   0   1
            201 0   0   1   0
            301 0   1   0   0
2019-01-05  301 0   2   0   0

How can I get such output?

jezrael · Accepted Answer · 2020-05-12 10:26:18Z

2

Use GroupBy.size with Series.reindex:

df = df1.groupby(['OnlyDate','UserID','TBID']).size()
mux = pd.MultiIndex.from_product(df.index.levels)
df = df.reindex(mux, fill_value=0).sort_index(level=[1,2,0]).reset_index(name='count')

print (df.head(10))
     OnlyDate  UserID  TBID  count
0  2019-01-01     101  1000      1
1  2019-01-02     101  1000      0
2  2019-01-03     101  1000      0
3  2019-01-04     101  1000      0
4  2019-01-05     101  1000      0
5  2019-01-01     101  1400      0
6  2019-01-02     101  1400      1
7  2019-01-03     101  1400      0
8  2019-01-04     101  1400      0
9  2019-01-05     101  1400      0

edited May 12, 2020 at 10:26

answered May 12, 2020 at 7:56

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Mark Wang Over a year ago

I don't think the UserID 301 part is right...(missing 2000)

Shivkumar kondi Over a year ago

Hi @jezrael , Its only sort_index(level=[1,0]) as level-2 is not there. Overall it doesn't provide the exact solution, but it gave me some hint. Even I am facing memory Error as my dataframe is of shape [250K ,4]

Mark Wang · Accepted Answer · 2020-05-12 10:26:15Z

1

Basic idea is to conduct groupby size. The nuisance is to fill the missing index with a 0 value, can be achieved by reindex or data reshaping. Below is the reshaping approach,

(df1.groupby(['OnlyDate','UserID','TBID'])
    .size()
    .unstack('OnlyDate', fill_value=0) 
    .unstack('UserID', fill_value=0)
    .unstack()
    .reset_index(name='count'))

answered May 12, 2020 at 10:26

Mark Wang

2,7579 silver badges18 bronze badges

1 Comment

Shivkumar kondi Over a year ago

Exactly , I need to understand the unstack machanism . Thanks mark

Collectives™ on Stack Overflow

Map Data in pandas

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related