3

I have a list:

x = ['hi', 'hello', '-', '01.01.9999']

And I have a DataFrame with a lot of columns. I want to loop over all columns and count the occurrences of my custom list.

As a result, I want something like this:

column_1, 'hi', 23
column_1, 'hello', 3
column_1, '-', 5
column_1, '01.01.9999', 0
...
column_n, 'hi', 0
column_n, 'hello', 35
column_n, '-', 15
column_n, '01.01.9999', 54

Already have this:

user_selected_features['dummy_key_words'] = ['hi', 'hello', '-', '01.01.9999']
for x in user_selected_features['dummy_key_words']:
    for column in _tmp_df:

I tried a lot of things in the loop, but nothing seems to return the correct result.

count = _tmp_df[_tmp_df[column] == x].count()
count = _tmp_df[column].str.count(x)
count = [_tmp_df[column] == x].count

How can I count the occurrence of a custom value per column in a DataFrame?

18
  • 1
    How is the original dataframe? Commented Aug 19, 2021 at 12:37
  • Can you include samples of user_selected_features and _tmp_df? Commented Aug 19, 2021 at 12:39
  • Just a "standard" dataframe with headers, columns and different data in it Commented Aug 19, 2021 at 12:40
  • 1
    Thank you all, @AnuragDabas answer looked the most easy and straight forward solution to me Commented Aug 19, 2021 at 13:35
  • 1
    @AnuragDabas no offense taken ;) Yes I have tested on x = ['hi', 'hello', '-', '01.01.9999'];import string;np.random.seed(0);df = pd.DataFrame(np.random.choice(x+list(string.ascii_letters), size=100000).reshape(-1, 500),columns=[chr(i) for i in range(500)]). But as I said, it's possible one answer is better in one case and not in another one. This was the only point of my comment. Your answer is perfectly fine! Commented Aug 19, 2021 at 18:23

8 Answers 8

2

Yet another way by concat()+list comprehension which you can try:

out=pd.concat([df.loc[df[y].isin(x),y].value_counts() for y in df],axis=1)

OR

without passing axis parameter in concat():

out=pd.concat([df.loc[df[y].isin(x),y].value_counts() for y in df]).reset_index()
Sign up to request clarification or add additional context in comments.

2 Comments

Works like a charm, with one exception. It doesnt return values if the keyword is a number and the pandas column is "int64". Any idea..? Thank you for your help! edit: nevermind, works with applymap(str)
@JohnDole sir I tested it and it is working if you have int64 column and an integer value in x...sir can you pls recheck :)
1

I do not have a DataFrame as an example, but you can try this :

>>> df[df['text'].isin(x)]
...     .groupby('text', as_index=False)['value']
...     .sum()
...     .sort_values('value', ascending=False)

Comments

1

A very simple suggestion, given a df:

import pandas as pd
data = pd.DataFrame({'Col_1':['hi','hello'],'Col_2':['-','not_imp']})
keywords_check=['hi','hello','-']

'   Col_1    Col_2
0     hi        -
1  hello    not_imp'

You can loop and use value_counts:

list_values=[]
for col in data.columns:
    col_count = data[col].value_counts().to_frame()
    list_values.append(col_count)

And then:

pd.concat(list_values).T[keywords_check]

Returns a column per word and count per row

'        hi  hello    -
  Col_1  1.0    1.0  NaN
  Col_2  NaN    NaN  1.0'

Comments

1

Try:

# Sample
>>> df

    A           B      C
0  hi       hello   word
1  in  01.01.9999  maybe

# Create a multiindex to have all possible combinations at the end
mi = pd.MultiIndex.from_product([df.columns, x], names=['column', 'word'])

# Output
>>> df.apply(lambda w: w[w.isin(x)].value_counts()) \
      .rename_axis(index='word', columns='column') \
      .unstack().rename('count').dropna().astype(int) \
      .reindex(mi, fill_value=0).reset_index()

   column        word  count
0       A          hi      1
1       A       hello      0
2       A           -      0
3       A  01.01.9999      0
4       B          hi      0
5       B       hello      1
6       B           -      0
7       B  01.01.9999      1
8       C          hi      0
9       C       hello      0
10      C           -      0
11      C  01.01.9999      0

Comments

1

You can try with apply and value_counts to get the counts. Then use stack() and swaplevel() to match your required output format.

Code:
counter = df.apply(pd.value_counts).reindex(x).fillna(0)
output = counter.astype(int).stack().swaplevel()
Example:
df = pd.DataFrame({"column_1": ["hi", "hello", "hello", "bye", "nothing", "01.01.9999"],
                   "column_2": ["hi", "hi", "hi", "-", "-", "nothing"],
                   "column_3": ["hi", "hi", "hello", "-", "-", "nothing"]
                   })
x = ['hi', 'hello', '-', '01.01.9999']
counter = df.apply(pd.value_counts).reindex(x).fillna(0)
output = counter.astype(int).stack().swaplevel()

>>> output
column_1  hi            1
column_2  hi            3
column_3  hi            2
column_1  hello         2
column_2  hello         0
column_3  hello         1
column_1  -             0
column_2  -             2
column_3  -             2
column_1  01.01.9999    1
column_2  01.01.9999    0
column_3  01.01.9999    0
dtype: int32

Comments

1

You can compute the value counts for a single columns as follows:

df['col1'].value_counts()

To count the values for all columns, you can do the following:

df.apply(pd.Series.value_counts).fillna(0)

This will give you a dataframe with the values as index, the column names the same as the original column names, and the values the number of occurences in the original dataframe.

You can get the counts per column for specific values by selecting only those rows from the resulting dataframe.

As an example:

df = pd.DataFrame(
    {
        "col1": ["a", "b", 1, "a"],
        "col2": ["a", "a", "c", "c"],
        "col3": ["a", 1, 1, "d"],
    }
)

counts = df.apply(pd.Series.value_counts).fillna(0)
counts.loc[["a", 1]]

Will give:

    col1    col2    col3
"a" 2.0     2.0     1.0
1   1.0     0.0     2.0

4 Comments

I am looking for a value counts for specific words, not for the values in the DataFrame itself
@JohnDole Does the example I added solve your problem?
Hi @Swier, I tried your code. It looks like it works, with 2 exceptions: The NA are not filled and it looks like it only works for STRINGS, not for numbers. Do you have an idea? Appreciate your help, thank you very much!
@JohnDole It should work with numbers as well (see updated example), and I can't tell why it doesn't fill in the NaNs without your code.
1

I am surprised no one proposed a simple answer using stack/unstack:

x = ['hi', 'hello', '-', '01.01.9999']
(df.stack()
   .groupby(level=1).value_counts()
   .unstack(level=0, fill_value=0).loc[x]
)

output:

            column_1  column_2  column_3
hi                 1         3         2
hello              2         0         1
-                  0         2         2
01.01.9999         1         0         0

input:

     column_1 column_2 column_3
0          hi       hi       hi
1       hello       hi       hi
2       hello       hi    hello
3         bye        -        -
4     nothing        -        -
5  01.01.9999  nothing  nothing

keep as long format:

(df.stack()
   .groupby(level=1).value_counts()
   .loc(axis=0)[pd.IndexSlice[:, x]]
)

output:

column_1  hi            1
column_2  hi            3
column_3  hi            2
column_1  hello         2
column_3  hello         1
column_2  -             2
column_3  -             2
column_1  01.01.9999    1

Comments

1

Try:

import pandas as pd
import numpy as np

df = pd.DataFrame({'strings': ['hi', 'hello', '-', '01.01.9999', 'hi', np.nan, '01.01.9999', 'hi'],\
                   'stringsToo': ['hi', np.nan, '-', '01.01.9999', 'hello', '-', '01.01.9999', 'hi']})

x = ['hi', 'hello', '-', '01.01.9999']

ss = []

for i, col in enumerate(df.columns):
    s = df[col].str.get_dummies().reindex(columns=x).sum()
    s = s.rename(col)
    ss.append(s)
    
df_counts = pd.concat(ss, axis=1, keys=[s.name for s in ss])


print(df, '\n')
print(df_counts)

      strings  stringsToo
0          hi          hi
1       hello         NaN
2           -           -
3  01.01.9999  01.01.9999
4          hi       hello
5         NaN           -
6  01.01.9999  01.01.9999
7          hi          hi 

            strings  stringsToo
hi                3           2
hello             1           1
-                 1           2
01.01.9999        2           2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.