Python Pandas Groupby Resetting Values Based on Index

Question

So I have a dataframe that contains some wrong information that I want to fix:

import pandas as pd
tuples_index = [(1,1990), (2,1999), (2,2002), (3,1992), (3,1994), (3,1996)]
index = pd.MultiIndex.from_tuples(tuples_index, names=['id', 'FirstYear'])
df = pd.DataFrame([2007, 2006, 2006, 2000, 2000, 2000], index=index, columns=['LastYear'] )


df
Out[4]: 
              LastYear
id FirstYear          
1  1990           2007
2  1999           2006
   2002           2006
3  1992           2000
   1994           2000
   1996           2000

id refers to a business, and this DataFrame is a small example slice of a much larger one that shows how a business moves. Each record is a unique location, and I want to capture the first and last year it was there. The current 'LastYear' is accurate for businesses with only one record, and accurate for the latest record of businesses for more than one record. What the df should look like at the end is this:

              LastYear
id FirstYear          
1  1990           2007
2  1999           2002
   2002           2006
3  1992           1994
   1994           1996
   1996           2000

And what I did to get it there was super clunky:

multirecord = df.groupby(level=0).filter(lambda x: len(x) > 1)
multirecord_grouped = multirecord.groupby(level=0)

ls = []
for _, group in multirecord_grouped:
    levels = group.index.get_level_values(level=1).tolist() + [group['LastYear'].iloc[-1]]
    ls += levels[1:]

multirecord['LastYear'] = pd.Series(ls, index=multirecord.index.copy())
final_joined = pd.concat([df.groupby(level=0).filter(lambda x: len(x) == 1),multirecord]).sort_index()

Is there a better way?

piRSquared · Accepted Answer · 2016-08-23 23:34:27Z

6

shift_year = lambda df: df.index.get_level_values('FirstYear').to_series().shift(-1)
df.groupby(level=0).apply(shift_year) \
    .combine_first(df.LastYear).astype(int) \
    .rename('LastYear').to_frame()

answered Aug 23, 2016 at 23:34

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Kartik Over a year ago

Who else, but you, could get all that done with just one line?

jesseWUT Over a year ago

sorry for not mentioning this from the get-go, but the dataframe that this is being operated on is ~54 million rows. This code is very elegant but it will take hours to run. Could you think of anything that could speed it up?

Collectives™ on Stack Overflow

Python Pandas Groupby Resetting Values Based on Index

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related