0

I have a Pandas Dataframe that has a row for each user. Each user took a survey that captured scores from 0 - 5 for a series of survey questions. It looks something like this:

df1 = pd.DataFrame({'User': ['user_a', 'user_b', 'user_c'], 'Cat1_score': [0, 5, 3], 'Cat2_score': [0, 2, 5], 'Cat3_score': [4, 5, 1})

I want to count across the row, not the column and I just can't wrap my head around it to call the right method(s).

If I use:

df1.count(axis='columns')

That just tells me the number of cells that are not NULL.

This following is closer to what I want, but I have 100 different columns to evaluate for each row, and I don't want to manually have to type each out.

df1.value_counts('column_name')

What I would really like is to end up with a data frame that looks something like this:

df2 = pd.DataFrame({'User': ['user_a', 'user_b', 'user_c'], 'zero': [2, 0, 0], 'one': [0, 0, 1], 'two': [0, 1, 0], 'three': [0, 0, 1], 'four': [1, 0, 0], 'five': [0, 2, 1]})

I want to count the frequency of how many of the users' respons = 0, or = 1, or = 5 ect. This might be a case of Friday-afternoon-at-work-lack-of-creative-thinking-brain if the answer is obvious.

UPDATE: The suggested answer found in this thread doesn't produce the best output for my needs. The code below produces a very clean data frame that I can use to join with other user tables I have and then save the resulting table to Excel.

2

2 Answers 2

2

Using a crosstab:

s = df1.set_index('User').stack()
out = (pd.crosstab(s.index.get_level_values('User'), s)
         .rename_axis(index='User', columns=None).reset_index()
      )

Variant:

tmp = df1.melt('User')

out = (pd.crosstab(tmp['User'], tmp['value'])
         .rename_axis(columns=None).reset_index()
      )

Output:

     User  0  1  2  3  4  5
0  user_a  2  0  0  0  1  0
1  user_b  0  0  1  0  0  2
2  user_c  0  1  0  1  0  1
Sign up to request clarification or add additional context in comments.

1 Comment

Here they offered out = pd.crosstab(tmp.User, tmp.value.fillna('NaN')) once using melt + crosstab approach.
0

You can do this in one line using Series.value_counts per rows in DataFrame.apply:

out = df1.set_index('User').apply(lambda x: x.value_counts(), axis=1).fillna(0).astype(int)

print(out)
#        0  1  2  3  4  5
#User                    
#user_a  2  0  0  0  1  0
#user_b  0  0  1  0  0  2
#user_c  0  1  0  1  0  1

Alternatively using DataFrame.melt with DataFrame.pivot_table:

out = df1.melt('User').pivot_table(index='User', columns='value', aggfunc='size', fill_value=0)

print(out)
#value   0  1  2  3  4  5
#User                    
#user_a  2  0  0  0  1  0
#user_b  0  0  1  0  0  2
#user_c  0  1  0  1  0  1

This answer is inspired by @jezrael; Ref.

B̶a̶s̶e̶d̶ o̶n̶ [̶t̶h̶i̶s̶ i̶n̶v̶e̶s̶t̶i̶g̶a̶t̶i̶o̶n̶]̶(̶h̶t̶t̶p̶s̶:̶//s̶t̶a̶c̶k̶o̶v̶e̶r̶f̶l̶o̶w̶.c̶o̶m̶/a̶/7̶6̶0̶0̶2̶0̶6̶1̶/1̶0̶4̶5̶2̶7̶0̶0̶)̶, i̶t̶ s̶e̶e̶m̶s̶ ̶v̶a̶l̶u̶e̶_̶c̶o̶u̶n̶t̶s̶(̶)̶̶ i̶s̶ t̶h̶e̶ b̶e̶t̶t̶e̶r̶ w̶a̶y̶ t̶o̶ g̶o̶ i̶n̶ t̶e̶r̶m̶s̶ o̶f̶ e̶f̶f̶i̶c̶i̶e̶n̶c̶y̶!̶

2 Comments

Using apply on axis=1 is inefficient. crosstab is the equivalent of value_counts when you have groups. The scenario for the timings which you referenced is different, this is without groups. In this case value_counts is indeed the most efficient.
oh, thanks I was not aware of with\without group scenario. I edited my answer. Thanks for this input.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.