1

I have a data frame with 3 columns. Each column contains yes,no, or nan. I am trying to find the frequency of each column based on column a. I was able to do this with describe().

import pandas as pd, numpy as np

df2 = pd.DataFrame({'a':['yes','yes','no','yes','no','yes'],
                        'b':['no','yes','no','yes','no','no'],
                        'c':['yes','yes','yes','no','no', np.nan]})

df2.groupby('a').describe().transpose()

a    no                   yes                 
  count unique top freq count unique  top freq
b     2      1  no    2     4      2   no    2
c     2      2  no    1     3      2  yes    2

I am having trouble selecting the describe columns I want. Below is an example of how I would like it to look. The freq/total_count column is the freq over total freq of the row. For example, b & no is 2/6.

a    no                                      yes                
  count top freq freq/total_count   count top freq freq/total_count
b     2  no    2     33%             4    no    2     33% 
c     2  no    1     20%             3   yes    2     40%

Please let me know if more information is needed.

2
  • Sorry but why aren't the expected values 50% 50%, 0.333% 0.666%? as the first row total is 2+2=4 and last row is 1+2=3 Commented Feb 16, 2016 at 15:50
  • I wanted to do it over 2+4=6 and 2+3=5 because I wanted to do it over the total number of observations Commented Feb 16, 2016 at 15:59

1 Answer 1

2

You're on the right track. The df2.groupby('a').describe().transpose() command gives a DataFrame with a MultiIndex. To select/manipulate individual pieces of the DataFrame, you have to first select the 'yes' or 'no' index, then the column index.

import pandas as pd, numpy as np

df2 = pd.DataFrame({'a':['yes','yes','no','yes','no','yes'],
                    'b':['no','yes','no','yes','no','no'],
                    'c':['yes','yes','yes','no','no', np.nan]})

data = df2.groupby('a').describe().transpose()

data['no','freq/total_count']=np.nan
data['yes','freq/total_count']=np.nan

for ind in data.index:
    data['no','freq/total_count'][ind] = data['no']['freq'][ind]/(data['no']['count'][ind]+data['yes']['count'][ind])*100
    data['yes','freq/total_count'][ind] = data['yes']['freq'][ind]/(data['no']['count'][ind]+data['yes']['count'][ind])*100


data['no','freq/total_count'] = data['no','freq/total_count'].map('{0:.0f}%'.format)
data['yes','freq/total_count'] = data['yes','freq/total_count'].map('{0:.0f}%'.format)

The output is

a   no                          yes                           no                 yes
    count  unique  top   freq   count   unique   top   freq   freq/total_count   freq/total_count
b   2      1       no    2      4       2        no    2      33%                33%
c   2      2       no    1      3       2        yes   2      20%                40%

To pretty print this, we want to remove the 'unique' column header. Then put the 'no' section together and the 'yes' section together.

del data['no','unique']
del data['yes','unique']
pd.concat([data['no'],data['yes']],axis=1,keys=['no','yes'])

Giving the final output:

a   no                                     yes
    count  top   freq   freq/total_count   count   top   freq   freq/total_count
b   2      no    2      33%                4       no    2      33%
c   2      no    1      20%                3       yes   2      40%
Sign up to request clarification or add additional context in comments.

2 Comments

thanks, but how do you get rid of unique for all columns?
The command del data['no', 'unique'] will delete the unique in the 'no' section. Do the same with 'yes' as well

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.