1

Suppose we have some measurement for each city each year:

   City   Year    Income
1  NYC    2000    1.1
2  NYC    2001    2.4
3  NYC    2002    3.4
...
12 London 2000    1.2
13 London 2001    1.5
...

pd.groupby('City')

   City   Year    Income
   NYC    2000    1.1
          2001    2.4
          2002    3.4
   London 2000    1.2
          2001    1.5
   Pairs  2000    3.2
          2001    1.31
          2002    2.2

Now I know the Year for each city should be [2000,2001,2002]. How can I add missing rows? Here, London doesn't have 2002. So I want to achieve:

   City   Year    Income
   NYC    2000    1.1
          2001    2.4
          2002    3.4
   London 2000    1.2
          2001    1.5
          2002    NA
   Pairs  2000    3.2
          2001    1.31
          2002    2.2
5
  • Hi Zachary, can you post your current code? Commented Mar 13, 2020 at 20:42
  • 1
    This may answer your question: stackoverflow.com/a/54033038/9357244 Commented Mar 13, 2020 at 20:43
  • Hi! The solution in the link works! df.Year = pd.Categorical(df.Year) df.groupby(['City','Year']).sum() This will add rows for missing years. Thanks a lot! Commented Mar 13, 2020 at 20:49
  • 2
    df.groupby(['City','Year']).sum().unstack('Year').stack('Year', dropna=False). Commented Mar 14, 2020 at 0:31
  • How do we mention the range by this unstack and stack? For example, if I wish to have to records from 1998 to 2010 in the asked question above. How do we mention the size or range of year records? Commented Aug 3, 2020 at 4:33

1 Answer 1

0

As commented: stack, unstack:

import pandas as pd

d = {"city":["A", "A", "A", "A", "B", "B", "B", "B"],
     "income":[1,2,2,3, 4,3,2,1],
     "year": [2000,2000,2001,2001, 2000,2001,2002,2000]}

df = pd.DataFrame(d)

#I want to calculate mean income per city and year
df.groupby(["city", "year"])["income"].mean()
# city  year
# A     2000    1.5
#       2001    2.5
# B     2000    2.5
#       2001    3.0
#       2002    2.0

#city A doesnt have any incomes for year 2002 in the original data so that year is missing in the groupby result.

df.groupby(["city", "year"])["income"].mean().unstack(fill_value=0) #Unstack to pivot, and fill missing values with 0
# year  2000  2001  2002
# city                  
# A      1.5   2.5   0.0
# B      2.5   3.0   2.0

#Then stack 
df.groupby(["city", "year"])["income"].mean().unstack(fill_value=0).stack()
# city  year
# A     2000    1.5
#       2001    2.5
#       2002    0.0
# B     2000    2.5
#       2001    3.0
#       2002    2.0
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.