Count columns with multiple values

Question

I have this dataframe

df = pd.DataFrame({
    "col1": ["Kev", "Kev", "Fr"],
    "col2": ["Red; Purple", "Yellow; Purple; Red", "Red; Yellow"], }, index=["1", "2", "3"])

It'll look like this

    col1 col2
1   Kev  Red; Purple
2   Kev  Yellow; Purple; Red
3   Fr   Red; Yellow

I want to count all the items in col2 according to col1. In this case the final df will be like this:

    col1 col2   count
1   Kev  Red    2
2   Kev  Purple 2
3   Kev  Yellow 1
4   Fr   Red    1
5   Fr   Yellow 1

I tried using explode:

df2 = (df.explode(df.columns.tolist())
      .apply(lambda col: col.str.split(';'))
      .explode('col1')
      .explode('col2'))

but that only gives me col1 and col2 of my desired dataframe, not the count. If I use crosstab on df2, I'll get a very different result.

I managed to get the desired output with 2 nested for loops, but my dataframe is so big that it takes almost a minute loading the function. I want to avoid this solution.

Rodalm · Accepted Answer · 2022-05-29 19:58:16Z

2

According to your example, you just need to explode col2 after spliting the strings.

Here is a simpler way using DataFrame.value_counts

import pandas as pd

df = pd.DataFrame({
    "col1": ["Kev", "Kev", "Fr"],
    "col2": ["Red; Purple", "Yellow; Purple; Red", "Red; Yellow"], }, index=["1", "2", "3"])


df2 = (
    df.assign(col2=df['col2'].str.split('; '))
      .explode('col2')
      .value_counts()
      .rename('count')
      .reset_index()
)

Output:

>>> df2 

  col1    col2  count
0  Kev  Purple      2
1  Kev     Red      2
2   Fr     Red      1
3   Fr  Yellow      1
4  Kev  Yellow      1

answered May 29, 2022 at 19:58

Rodalm

5,7589 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Drakax Over a year ago

If I was picky, I would add one more parameter to your code: ".sort_values(by='col1', ascending=False)" ;)

Rodalm Over a year ago

@Drakax value_counts already sorts by default in descending order ;)

Rodalm Over a year ago

Oh nevermid ... it sorts by frequency! I see that you mean by col1, sorry. I don't know if OP wants that, but I'm glad to add your suggestion if he does :)

Ynjxsjmh · Accepted Answer · 2022-05-29 19:34:05Z

1

After pd.crosstab, you can try melt

df2 = (df.explode(df.columns.tolist())
      .apply(lambda col: col.str.split('; ')) # <-- space here
      .explode('col1')
      .explode('col2'))


out = (pd.crosstab(df2['col1'], df2['col2'])
       .melt(value_name='count', ignore_index=False)
       .reset_index())

print(out)

  col1    col2  count
0   Fr  Purple      0
1  Kev  Purple      2
2   Fr     Red      1
3  Kev     Red      2
4   Fr  Yellow      1
5  Kev  Yellow      1

answered May 29, 2022 at 19:34

Ynjxsjmh

30.3k7 gold badges43 silver badges64 bronze badges

1 Comment

Drakax Over a year ago

Be aware that your solution add a new row "Fr:Purple:0" :)

Collectives™ on Stack Overflow

Count columns with multiple values

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related