2

Here is my function:

def calculate_employment_two_digit_industry(df):
    df['intersection'] = df['racEmpProb'] * df['wacEmpProb']
    df['empProb'] = df['intersection'] / df['intersection'].sum()
    df['newEmp'] = df['empProb'] * df['Emp']

    df = df[['h_zcta', 'w_zcta', 'indID', 'newEmp', 'empProb']]
    df.rename(columns = {'newEmp' : 'Emp'}, inplace = True)

    return df

Here is my test:

def test_calculate_employment_two_digit_industry():
    testDf = pandas.DataFrame({'h_zcta'     : [99163, 99163, 99163, 99163],
                           'w_zcta'     : [83843, 83843, 83843, 83843],
                           'indID'      : [11, 21, 22, 42],
                           'Emp'        : [20, 20, 40, 40],
                           'racEmpProb' : [0.5, 0.5, 0.6, 0.4],
                           'wacEmpProb' : [0.7, 0.3, 0.625, 0.375],
                           '1_digit'    : [1, 1, 2, 2]})

    expectedDf = pandas.DataFrame({'h_zcta'   : [99163, 99163, 99163, 99163],
                             'w_zcta'   : [83843, 83843, 83843, 83843],
                             'indID'    : [11, 21, 22, 42],
                             'Emp'      : [14, 6, 28.5716, 11.4284],
                             'empProb'  : [0.7, 0.3, 0.71429, 0.28571]})

    expectedDf = expectedDf[['h_zcta', 'w_zcta', 'indID', 'Emp', 'empProb']]

    final = testDf.groupby(['h_zcta', 'w_zcta', '1_digit'])\
               .apply(calculate_employment_two_digit_industry).reset_index()

    assert expected.equals(final)

As you can see within in the test I have what I expect the function to return. Aside from potential mathematical errors within the code which I can fix, here is the dataframe that is returned, how do I have it return a normal dataframe (if normal is the correct term) i.e., without the layers just columns and rows?

                      h_zcta  w_zcta  indID   Emp  empProb
h_zcta w_zcta 1_digit                                        
99163  83843  1       0   99163   83843     11  14.0      0.7
                      1   99163   83843     21   6.0      0.3
              2       0   99163   83843     22  28.0      0.7
                      1   99163   83843     42  12.0      0.3

Thank you in advance.

1 Answer 1

2

You need .reset_index(drop=True)

That is:

final = testDf.groupby(['h_zcta', 'w_zcta', '1_digit']).apply(
    calculate_employment_two_digit_industry).reset_index(drop=True)

>>> final.index
RangeIndex(start=0, stop=4, step=1)
Sign up to request clarification or add additional context in comments.

5 Comments

alternatively, I believe testDF.groupby([...], as_index=False) might also work.
that still gives MultiIndex(...) for final.index
Thank you that worked perfectly. If you have time, would you mind answering why this occurs (what is going during the groupby and apply to create a MultiIndex)?
As I understand it, groupby makes a DataFrameGroupBy object, then .apply() iterates over the DataFrameGroupBy making a new DataFrame with a MultiIndex for each of the groups. The .apply() also applies your function to each row in the new DataFrame. I think you always end up with a MultiIndex if you .groupby().apply().
Thank you for the explanation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.