Count custom values from a list in a python dataframe per column

Question

I have a list:

x = ['hi', 'hello', '-', '01.01.9999']

And I have a DataFrame with a lot of columns. I want to loop over all columns and count the occurrences of my custom list.

As a result, I want something like this:

column_1, 'hi', 23
column_1, 'hello', 3
column_1, '-', 5
column_1, '01.01.9999', 0
...
column_n, 'hi', 0
column_n, 'hello', 35
column_n, '-', 15
column_n, '01.01.9999', 54

Already have this:

user_selected_features['dummy_key_words'] = ['hi', 'hello', '-', '01.01.9999']

for x in user_selected_features['dummy_key_words']:
    for column in _tmp_df:

I tried a lot of things in the loop, but nothing seems to return the correct result.

count = _tmp_df[_tmp_df[column] == x].count()
count = _tmp_df[column].str.count(x)
count = [_tmp_df[column] == x].count

How can I count the occurrence of a custom value per column in a DataFrame?

Can you include samples of user_selected_features and _tmp_df? — not_speshal
– not_speshal, Commented Aug 19, 2021 at 12:39
Just a "standard" dataframe with headers, columns and different data in it — JohnDole
– JohnDole, Commented Aug 19, 2021 at 12:40
Thank you all, @AnuragDabas answer looked the most easy and straight forward solution to me — JohnDole
– JohnDole, Commented Aug 19, 2021 at 13:35
@AnuragDabas no offense taken ;) Yes I have tested on x = ['hi', 'hello', '-', '01.01.9999'];import string;np.random.seed(0);df = pd.DataFrame(np.random.choice(x+list(string.ascii_letters), size=100000).reshape(-1, 500),columns=[chr(i) for i in range(500)]). But as I said, it's possible one answer is better in one case and not in another one. This was the only point of my comment. Your answer is perfectly fine! — mozway
– mozway, Commented Aug 19, 2021 at 18:23

Anurag Dabas · Accepted Answer · 2021-08-19 13:12:51Z

2

Yet another way by concat()+list comprehension which you can try:

out=pd.concat([df.loc[df[y].isin(x),y].value_counts() for y in df],axis=1)

OR

without passing axis parameter in concat():

out=pd.concat([df.loc[df[y].isin(x),y].value_counts() for y in df]).reset_index()

edited Aug 19, 2021 at 13:12

answered Aug 19, 2021 at 13:07

Anurag Dabas

24.3k9 gold badges25 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

JohnDole Over a year ago

Works like a charm, with one exception. It doesnt return values if the keyword is a number and the pandas column is "int64". Any idea..? Thank you for your help! edit: nevermind, works with applymap(str)

Anurag Dabas Over a year ago

@JohnDole sir I tested it and it is working if you have int64 column and an integer value in x...sir can you pls recheck :)

tlentali · Accepted Answer · 2021-08-19 12:43:19Z

1

I do not have a DataFrame as an example, but you can try this :

>>> df[df['text'].isin(x)]
...     .groupby('text', as_index=False)['value']
...     .sum()
...     .sort_values('value', ascending=False)

answered Aug 19, 2021 at 12:43

tlentali

3,4632 gold badges18 silver badges23 bronze badges

Comments

Alejandro A · Accepted Answer · 2021-08-19 12:44:56Z

1

A very simple suggestion, given a df:

import pandas as pd
data = pd.DataFrame({'Col_1':['hi','hello'],'Col_2':['-','not_imp']})
keywords_check=['hi','hello','-']

'   Col_1    Col_2
0     hi        -
1  hello    not_imp'

You can loop and use value_counts:

list_values=[]
for col in data.columns:
    col_count = data[col].value_counts().to_frame()
    list_values.append(col_count)

And then:

pd.concat(list_values).T[keywords_check]

Returns a column per word and count per row

'        hi  hello    -
  Col_1  1.0    1.0  NaN
  Col_2  NaN    NaN  1.0'

answered Aug 19, 2021 at 12:44

Alejandro A

1,2201 gold badge10 silver badges34 bronze badges

Comments

Corralien · Accepted Answer · 2021-08-19 13:11:12Z

Try:

# Sample
>>> df

    A           B      C
0  hi       hello   word
1  in  01.01.9999  maybe

# Create a multiindex to have all possible combinations at the end
mi = pd.MultiIndex.from_product([df.columns, x], names=['column', 'word'])

# Output
>>> df.apply(lambda w: w[w.isin(x)].value_counts()) \
      .rename_axis(index='word', columns='column') \
      .unstack().rename('count').dropna().astype(int) \
      .reindex(mi, fill_value=0).reset_index()

   column        word  count
0       A          hi      1
1       A       hello      0
2       A           -      0
3       A  01.01.9999      0
4       B          hi      0
5       B       hello      1
6       B           -      0
7       B  01.01.9999      1
8       C          hi      0
9       C       hello      0
10      C           -      0
11      C  01.01.9999      0

not_speshal · Accepted Answer · 2021-08-19 13:11:59Z

You can try with apply and value_counts to get the counts. Then use stack() and swaplevel() to match your required output format.

Code:

counter = df.apply(pd.value_counts).reindex(x).fillna(0)
output = counter.astype(int).stack().swaplevel()

Example:

df = pd.DataFrame({"column_1": ["hi", "hello", "hello", "bye", "nothing", "01.01.9999"],
                   "column_2": ["hi", "hi", "hi", "-", "-", "nothing"],
                   "column_3": ["hi", "hi", "hello", "-", "-", "nothing"]
                   })
x = ['hi', 'hello', '-', '01.01.9999']
counter = df.apply(pd.value_counts).reindex(x).fillna(0)
output = counter.astype(int).stack().swaplevel()

>>> output
column_1  hi            1
column_2  hi            3
column_3  hi            2
column_1  hello         2
column_2  hello         0
column_3  hello         1
column_1  -             0
column_2  -             2
column_3  -             2
column_1  01.01.9999    1
column_2  01.01.9999    0
column_3  01.01.9999    0
dtype: int32

Swier · Accepted Answer · 2021-08-19 13:14:24Z

1

You can compute the value counts for a single columns as follows:

df['col1'].value_counts()

To count the values for all columns, you can do the following:

df.apply(pd.Series.value_counts).fillna(0)

This will give you a dataframe with the values as index, the column names the same as the original column names, and the values the number of occurences in the original dataframe.

You can get the counts per column for specific values by selecting only those rows from the resulting dataframe.

As an example:

df = pd.DataFrame(
    {
        "col1": ["a", "b", 1, "a"],
        "col2": ["a", "a", "c", "c"],
        "col3": ["a", 1, 1, "d"],
    }
)

counts = df.apply(pd.Series.value_counts).fillna(0)
counts.loc[["a", 1]]

Will give:

    col1    col2    col3
"a" 2.0     2.0     1.0
1   1.0     0.0     2.0

edited Aug 19, 2021 at 13:14

answered Aug 19, 2021 at 12:43

Swier

4,2463 gold badges32 silver badges56 bronze badges

4 Comments

JohnDole Over a year ago

I am looking for a value counts for specific words, not for the values in the DataFrame itself

Swier Over a year ago

@JohnDole Does the example I added solve your problem?

JohnDole Over a year ago

Hi @Swier, I tried your code. It looks like it works, with 2 exceptions: The NA are not filled and it looks like it only works for STRINGS, not for numbers. Do you have an idea? Appreciate your help, thank you very much!

Swier Over a year ago

@JohnDole It should work with numbers as well (see updated example), and I can't tell why it doesn't fill in the NaNs without your code.

mozway · Accepted Answer · 2021-08-19 13:23:03Z

I am surprised no one proposed a simple answer using stack/unstack:

x = ['hi', 'hello', '-', '01.01.9999']
(df.stack()
   .groupby(level=1).value_counts()
   .unstack(level=0, fill_value=0).loc[x]
)

output:

            column_1  column_2  column_3
hi                 1         3         2
hello              2         0         1
-                  0         2         2
01.01.9999         1         0         0

input:

     column_1 column_2 column_3
0          hi       hi       hi
1       hello       hi       hi
2       hello       hi    hello
3         bye        -        -
4     nothing        -        -
5  01.01.9999  nothing  nothing

keep as long format:

(df.stack()
   .groupby(level=1).value_counts()
   .loc(axis=0)[pd.IndexSlice[:, x]]
)

output:

column_1  hi            1
column_2  hi            3
column_3  hi            2
column_1  hello         2
column_3  hello         1
column_2  -             2
column_3  -             2
column_1  01.01.9999    1

MDR · Accepted Answer · 2021-08-19 13:33:23Z

Try:

import pandas as pd
import numpy as np

df = pd.DataFrame({'strings': ['hi', 'hello', '-', '01.01.9999', 'hi', np.nan, '01.01.9999', 'hi'],\
                   'stringsToo': ['hi', np.nan, '-', '01.01.9999', 'hello', '-', '01.01.9999', 'hi']})

x = ['hi', 'hello', '-', '01.01.9999']

ss = []

for i, col in enumerate(df.columns):
    s = df[col].str.get_dummies().reindex(columns=x).sum()
    s = s.rename(col)
    ss.append(s)
    
df_counts = pd.concat(ss, axis=1, keys=[s.name for s in ss])


print(df, '\n')
print(df_counts)

      strings  stringsToo
0          hi          hi
1       hello         NaN
2           -           -
3  01.01.9999  01.01.9999
4          hi       hello
5         NaN           -
6  01.01.9999  01.01.9999
7          hi          hi 

            strings  stringsToo
hi                3           2
hello             1           1
-                 1           2
01.01.9999        2           2

Collectives™ on Stack Overflow

Count custom values from a list in a python dataframe per column

8 Answers 8

2 Comments

Comments

Comments

Comments

Code:

Example:

Comments

4 Comments

keep as long format:

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

2 Comments

Comments

Comments

Comments

Code:

Example:

Comments

4 Comments

keep as long format:

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related