How to add a list of unique values from a column to a dataframe?

Question

So have I some code that creates a new dataframe that counts the number of unique states with at least 1 non null value and the total number of non null values grouped by year. Code works fine, but I want to modify it to include a new rows that lists the unique states.

This is my data:

    year    state   var1    var2    
0   2018    CA       NaN     2    
1   2018    TX       1       NaN    
2   2018    FL       NaN     NaN  
3   2018    AL       1       2    
4   2018    AL       NaN     1   
6   2019    CA       NaN     NaN  
7   2019    TX       1       1    
8   2019    FL       NaN     NaN  
9   2019    AL       2       1    
10  2019    AK       2       NaN

This my current output:

                                                          2018     2019
var1
      Number of unique states with at least 1 non-null:   2        3
      Number of respondents with non-null var:            2        3
      Average:                                            1        1
var2
      Number of unique states with at least 1 non-null:   2        2   
      Number of respondents with non-null var:            3        2
      Average:                                            1.5      1

This is the code that I'm working with:

c = df.groupby(['year', 'state']).count()
res = c.groupby('YEAR').agg([np.count_nonzero, sum]).T
res.index = res.index.set_levels(['Number of unique states with at least 1 non-null:', 
                                  'Number of respondents with non-null var:'], level=1)

z = res.swaplevel().T
res4 = pd.concat([z, pd.concat([z['Number of respondents with non-null var:'] / 
                                z['Number of unique states with at least 1 non-null:']], 
                              keys=['Average:'], axis=1),], 
                axis=1).T.swaplevel().sort_index()

And this is what I want the new output to look like:

                                                          2018         2019
var1
      Number of unique states with at least 1 non-null:   2            3
      Unique states with at least 1 non-null:             [TX, AL]     [TX, AL, AK]
      Number of respondents with non-null var:            2            3
      Average:                                            1            1
var2
      Number of unique states with at least 1 non-null:   2            2   
      Unique states with at least 1 non-null:             [AL, CA]     [TX, AL]
      Number of respondents with non-null var:            3            2
      Average:                                            1.5          1

Basically I want this row included "Unique states with at least 1 non-null:", listing the names of the states. How can I do this?

jezrael · Accepted Answer · 2022-10-03 11:33:10Z

2

I create new function f for states and also aggregate mean for label in MultiIndex, values are set by division selected rows by DataFrame.xs, last rename for new second level of MultiIndex:

c = df.groupby(['year', 'state']).count()

def f(x):
    return x.index[x.ne(0)].droplevel(0).tolist()

df = c.groupby(['year']).agg([np.count_nonzero,f,'sum', 'mean']).T
df11 = df.xs('sum', level=1, drop_level=False).div(df.xs('count_nonzero', level=1), level=0)
df.loc[pd.IndexSlice[:,'mean'],:] =  df11.rename({'sum':'mean'}).astype(np.float64).round(1)

d = {'count_nonzero':'Number of unique states with at least 1 non-null:', 
     'sum':'Number of respondents with non-null var:',
     'f':'Unique states with at least 1 non-null',
     'mean':'Average:'}
df = df.rename(d)
print (df)
year                                                        2018          2019
var1 Number of unique states with at least 1 non-null:         2             3
     Unique states with at least 1 non-null             [AL, TX]  [AK, AL, TX]
     Number of respondents with non-null var:                  2             3
     Average:                                                1.0           1.0
var2 Number of unique states with at least 1 non-null:         2             2
     Unique states with at least 1 non-null             [AL, CA]      [AL, TX]
     Number of respondents with non-null var:                  3             2
     Average:                                                1.5           1.0

edited Oct 3, 2022 at 11:33

answered Oct 3, 2022 at 8:16

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

miqcrom26 Over a year ago

I attempted to use your function on my full dataset and I'm getting the following error: ZeroDivisionError: division by zero

jezrael Over a year ago

@miqcrom26 - What code raisse error? df11 = df.xs('sum', level=1, drop_level=False).div(df.xs('count_nonzero', level=1), level=0) ? What is your pandas version?

miqcrom26 Over a year ago

Yes, I believe that's the part that's giving me problems. I currently using pandas version 1.3.5

jezrael Over a year ago

@miqcrom26 - Can you try df11 = df.xs('sum', level=1, drop_level=False).div(df.xs('count_nonzero', level=1).replace({ 0 : np.nan }), level=0) instaed df11 = df.xs('sum', level=1, drop_level=False).div(df.xs('count_nonzero', level=1), level=0) ?

miqcrom26 Over a year ago

Perfect! One last thing, I'm trying to round off the 'Average' value to 1 decimal place using df = df.round(1) on the entire dataframe, but it's not working. Do you know what I'm doing wrong?

|

Collectives™ on Stack Overflow

How to add a list of unique values from a column to a dataframe?

1 Answer 1

10 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Related