Pandas groupby(), then add missing rows

Question

Suppose we have some measurement for each city each year:

   City   Year    Income
1  NYC    2000    1.1
2  NYC    2001    2.4
3  NYC    2002    3.4
...
12 London 2000    1.2
13 London 2001    1.5
...

pd.groupby('City')

   City   Year    Income
   NYC    2000    1.1
          2001    2.4
          2002    3.4
   London 2000    1.2
          2001    1.5
   Pairs  2000    3.2
          2001    1.31
          2002    2.2

Now I know the Year for each city should be [2000,2001,2002]. How can I add missing rows? Here, London doesn't have 2002. So I want to achieve:

   City   Year    Income
   NYC    2000    1.1
          2001    2.4
          2002    3.4
   London 2000    1.2
          2001    1.5
          2002    NA
   Pairs  2000    3.2
          2001    1.31
          2002    2.2

This may answer your question: stackoverflow.com/a/54033038/9357244 — chris
– chris, Commented Mar 13, 2020 at 20:43
Hi! The solution in the link works! df.Year = pd.Categorical(df.Year) df.groupby(['City','Year']).sum() This will add rows for missing years. Thanks a lot! — Zachary HUANG
– Zachary HUANG, Commented Mar 13, 2020 at 20:49
df.groupby(['City','Year']).sum().unstack('Year').stack('Year', dropna=False). — Quang Hoang
– Quang Hoang, Commented Mar 14, 2020 at 0:31
How do we mention the range by this unstack and stack? For example, if I wish to have to records from 1998 to 2010 in the asked question above. How do we mention the size or range of year records? — Akhilesh Pothuri
– Akhilesh Pothuri, Commented Aug 3, 2020 at 4:33

Bera · Accepted Answer · 2023-08-02 06:57:34Z

As commented: stack, unstack:

import pandas as pd

d = {"city":["A", "A", "A", "A", "B", "B", "B", "B"],
     "income":[1,2,2,3, 4,3,2,1],
     "year": [2000,2000,2001,2001, 2000,2001,2002,2000]}

df = pd.DataFrame(d)

#I want to calculate mean income per city and year
df.groupby(["city", "year"])["income"].mean()
# city  year
# A     2000    1.5
#       2001    2.5
# B     2000    2.5
#       2001    3.0
#       2002    2.0

#city A doesnt have any incomes for year 2002 in the original data so that year is missing in the groupby result.

df.groupby(["city", "year"])["income"].mean().unstack(fill_value=0) #Unstack to pivot, and fill missing values with 0
# year  2000  2001  2002
# city                  
# A      1.5   2.5   0.0
# B      2.5   3.0   2.0

#Then stack 
df.groupby(["city", "year"])["income"].mean().unstack(fill_value=0).stack()
# city  year
# A     2000    1.5
#       2001    2.5
#       2002    0.0
# B     2000    2.5
#       2001    3.0
#       2002    2.0

Collectives™ on Stack Overflow

Pandas groupby(), then add missing rows

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related