Python groupby apply returning odd dataframe

Question

Here is my function:

def calculate_employment_two_digit_industry(df):
    df['intersection'] = df['racEmpProb'] * df['wacEmpProb']
    df['empProb'] = df['intersection'] / df['intersection'].sum()
    df['newEmp'] = df['empProb'] * df['Emp']

    df = df[['h_zcta', 'w_zcta', 'indID', 'newEmp', 'empProb']]
    df.rename(columns = {'newEmp' : 'Emp'}, inplace = True)

    return df

Here is my test:

def test_calculate_employment_two_digit_industry():
    testDf = pandas.DataFrame({'h_zcta'     : [99163, 99163, 99163, 99163],
                           'w_zcta'     : [83843, 83843, 83843, 83843],
                           'indID'      : [11, 21, 22, 42],
                           'Emp'        : [20, 20, 40, 40],
                           'racEmpProb' : [0.5, 0.5, 0.6, 0.4],
                           'wacEmpProb' : [0.7, 0.3, 0.625, 0.375],
                           '1_digit'    : [1, 1, 2, 2]})

    expectedDf = pandas.DataFrame({'h_zcta'   : [99163, 99163, 99163, 99163],
                             'w_zcta'   : [83843, 83843, 83843, 83843],
                             'indID'    : [11, 21, 22, 42],
                             'Emp'      : [14, 6, 28.5716, 11.4284],
                             'empProb'  : [0.7, 0.3, 0.71429, 0.28571]})

    expectedDf = expectedDf[['h_zcta', 'w_zcta', 'indID', 'Emp', 'empProb']]

    final = testDf.groupby(['h_zcta', 'w_zcta', '1_digit'])\
               .apply(calculate_employment_two_digit_industry).reset_index()

    assert expected.equals(final)

As you can see within in the test I have what I expect the function to return. Aside from potential mathematical errors within the code which I can fix, here is the dataframe that is returned, how do I have it return a normal dataframe (if normal is the correct term) i.e., without the layers just columns and rows?

                      h_zcta  w_zcta  indID   Emp  empProb
h_zcta w_zcta 1_digit                                        
99163  83843  1       0   99163   83843     11  14.0      0.7
                      1   99163   83843     21   6.0      0.3
              2       0   99163   83843     22  28.0      0.7
                      1   99163   83843     42  12.0      0.3

Thank you in advance.

zemekeneng · Accepted Answer · 2016-08-09 20:30:04Z

2

You need .reset_index(drop=True)

That is:

final = testDf.groupby(['h_zcta', 'w_zcta', '1_digit']).apply(
    calculate_employment_two_digit_industry).reset_index(drop=True)

>>> final.index
RangeIndex(start=0, stop=4, step=1)

edited Aug 9, 2016 at 20:30

answered Aug 9, 2016 at 20:12

zemekeneng

1,7253 gold badges16 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Paul H Over a year ago

alternatively, I believe testDF.groupby([...], as_index=False) might also work.

zemekeneng Over a year ago

that still gives MultiIndex(...) for final.index

j riot Over a year ago

Thank you that worked perfectly. If you have time, would you mind answering why this occurs (what is going during the groupby and apply to create a MultiIndex)?

zemekeneng Over a year ago

As I understand it, groupby makes a DataFrameGroupBy object, then .apply() iterates over the DataFrameGroupBy making a new DataFrame with a MultiIndex for each of the groups. The .apply() also applies your function to each row in the new DataFrame. I think you always end up with a MultiIndex if you .groupby().apply().

j riot Over a year ago

Thank you for the explanation.

Collectives™ on Stack Overflow

Python groupby apply returning odd dataframe

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related