2

what I have so far is a normal transactional dataframe with the following columns:

store | item | year | month | day | sales

'year' can be 2015, 2016, 2017.

With that I created a summary dataframe:

store_item_years = df.groupby(
['store','item','year'])['sales'].agg(
[np.sum, np.mean, np.std, np.median, np.min, np.max]).unstack(
fill_value=0)

The last one results in a Multi-Index with 2 levels, like this:

               sum                  mean
        year | 2015 | 2016 | 2017 | 2015 | 2016 | 2017 | ... 
store | item   sum1    ...   ...    mean1  mean2  ...  | ...    

Now I'd like to merge the summary table back onto the transactional one:

store | item | year | month | day | sales | + | sum+'by'+year | mean+'by'+year
               2015                              sum1              mean1
               2016                              sum2              mean2
               2017                              ...                ...

I am trying to merge with the following:

df = pd.merge(df, store_item_years, 
             left_on=['store', 'item', 'year'], 
             right_on=['store', 'item', 'year'],
             how='left')

which results in the following error:

KeyError: 'year'

Any ideas? I am just getting my head around groupby. I haven't looked into PivotTable yet.

Please keep in mind that the problem is simplified. The number of store_item combinations is 200+K and other groupbys having 300+ columns. But always the same principle.

Thanks a lot in advance.

2 Answers 2

2

I think you need first remove unstack and then use join for left join:

store_item_years = df.groupby(
['store','item','year'])['sales'].agg(
[np.sum, np.mean, np.std, np.median, np.min, np.max])

df = df.join(store_item_years, on=['store','item','year'])
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! Yes, I just noticed it as well. I will accept your answer since you beat me to mine :)
0

found the culprit. Removed .unstack().

store_item_years = df.groupby(
   ['store','item','year'])['sales'].agg(
   [np.sum, np.mean, np.std, np.median, np.min, np.max])

The following to keep the context:

store_item_years.columns = store_item_years.columns+'_by_year'

And merge like this:

pd.merge(df, store_item_years.reset_index(), 
         left_on=['store', 'item', 'year'], 
         right_on=['store', 'item', 'year'],
         how='left')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.