1

So have I some code that creates a new dataframe that counts the number of unique states with at least 1 non null value and the total number of non null values grouped by year. Code works fine, but I want to modify it to include a new rows that lists the unique states.

This is my data:

    year    state   var1    var2    
0   2018    CA       NaN     2    
1   2018    TX       1       NaN    
2   2018    FL       NaN     NaN  
3   2018    AL       1       2    
4   2018    AL       NaN     1   
6   2019    CA       NaN     NaN  
7   2019    TX       1       1    
8   2019    FL       NaN     NaN  
9   2019    AL       2       1    
10  2019    AK       2       NaN 

This my current output:

                                                          2018     2019
var1
      Number of unique states with at least 1 non-null:   2        3
      Number of respondents with non-null var:            2        3
      Average:                                            1        1
var2
      Number of unique states with at least 1 non-null:   2        2   
      Number of respondents with non-null var:            3        2
      Average:                                            1.5      1

This is the code that I'm working with:

c = df.groupby(['year', 'state']).count()
res = c.groupby('YEAR').agg([np.count_nonzero, sum]).T
res.index = res.index.set_levels(['Number of unique states with at least 1 non-null:', 
                                  'Number of respondents with non-null var:'], level=1)

z = res.swaplevel().T
res4 = pd.concat([z, pd.concat([z['Number of respondents with non-null var:'] / 
                                z['Number of unique states with at least 1 non-null:']], 
                              keys=['Average:'], axis=1),], 
                axis=1).T.swaplevel().sort_index()

And this is what I want the new output to look like:

                                                          2018         2019
var1
      Number of unique states with at least 1 non-null:   2            3
      Unique states with at least 1 non-null:             [TX, AL]     [TX, AL, AK]
      Number of respondents with non-null var:            2            3
      Average:                                            1            1
var2
      Number of unique states with at least 1 non-null:   2            2   
      Unique states with at least 1 non-null:             [AL, CA]     [TX, AL]
      Number of respondents with non-null var:            3            2
      Average:                                            1.5          1

Basically I want this row included "Unique states with at least 1 non-null:", listing the names of the states. How can I do this?

1 Answer 1

2

I create new function f for states and also aggregate mean for label in MultiIndex, values are set by division selected rows by DataFrame.xs, last rename for new second level of MultiIndex:

c = df.groupby(['year', 'state']).count()

def f(x):
    return x.index[x.ne(0)].droplevel(0).tolist()

df = c.groupby(['year']).agg([np.count_nonzero,f,'sum', 'mean']).T
df11 = df.xs('sum', level=1, drop_level=False).div(df.xs('count_nonzero', level=1), level=0)
df.loc[pd.IndexSlice[:,'mean'],:] =  df11.rename({'sum':'mean'}).astype(np.float64).round(1)

d = {'count_nonzero':'Number of unique states with at least 1 non-null:', 
     'sum':'Number of respondents with non-null var:',
     'f':'Unique states with at least 1 non-null',
     'mean':'Average:'}
df = df.rename(d)
print (df)
year                                                        2018          2019
var1 Number of unique states with at least 1 non-null:         2             3
     Unique states with at least 1 non-null             [AL, TX]  [AK, AL, TX]
     Number of respondents with non-null var:                  2             3
     Average:                                                1.0           1.0
var2 Number of unique states with at least 1 non-null:         2             2
     Unique states with at least 1 non-null             [AL, CA]      [AL, TX]
     Number of respondents with non-null var:                  3             2
     Average:                                                1.5           1.0
Sign up to request clarification or add additional context in comments.

10 Comments

I attempted to use your function on my full dataset and I'm getting the following error: ZeroDivisionError: division by zero
@miqcrom26 - What code raisse error? df11 = df.xs('sum', level=1, drop_level=False).div(df.xs('count_nonzero', level=1), level=0) ? What is your pandas version?
Yes, I believe that's the part that's giving me problems. I currently using pandas version 1.3.5
@miqcrom26 - Can you try df11 = df.xs('sum', level=1, drop_level=False).div(df.xs('count_nonzero', level=1).replace({ 0 : np.nan }), level=0) instaed df11 = df.xs('sum', level=1, drop_level=False).div(df.xs('count_nonzero', level=1), level=0) ?
Perfect! One last thing, I'm trying to round off the 'Average' value to 1 decimal place using df = df.round(1) on the entire dataframe, but it's not working. Do you know what I'm doing wrong?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.