1

I have a list of 'words' I want to count below

word_list = ['one','two','three']

And I have a column within pandas dataframe with text below.

TEXT
-----
"Perhaps she'll be the one for me."
"Is it two or one?"
"Mayhaps it be three afterall..."
"Three times and it's a charm."
"One fish, two fish, red fish, blue fish."
"There's only one cat in the hat."
"One does not simply code into pandas."
"Two nights later..."
"Quoth the Raven... nevermore."

The desired output that I would like is the following below, where I want to count the number of times the substrings defined in word_list appear in the strings of each row in the dataframe.

Word | Count
one        5     
two        3     
three      2 

Is there a way to do this in Python 2.7?

3 Answers 3

2

I would do this with vanilla python, first join the string:

In [11]: long_string = "".join(df[0]).lower()

In [12]: long_string[:50]  # all the words glued up
Out[12]: "perhaps she'll be the one for me.is it two or one?"

In [13]: for w in word_list:
     ...:     print(w, long_string.count(w))
     ...:
one 5
two 3
three 2

If you want to return a Series, you could use a dict comprehension:

In [14]: pd.Series({w: long_string.count(w) for w in word_list})
Out[14]:
one      5
three    2
two      3
dtype: int64
Sign up to request clarification or add additional context in comments.

5 Comments

@SunnysinhSolanki ah yes, depends what is column name of your strings, I guess should be " ".join(df["TEXT"]).lower()
I think this line should happen like this long_string = " ".join(df[0]).lower() ## Added space in join Because it might cause issue if we get list like below: ['text on','ending with'] == 'text onending with' This will match 'one' which it should not.
@SunnysinhSolanki yes, good spot! I guess we could also add "\x00" if we wanted to be clever (assuming you're not matching strings with null.
@cᴏʟᴅsᴘᴇᴇᴅ what is n? It depends on how many substrings you have (if few this will be faster, I guess it's O(n*m) for n is long_string length and m is number of substrings)
@cᴏʟᴅsᴘᴇᴇᴅ which is to say, if m is large yours will be faster (but if m is small this might beat you).
1

Use str.extractall + value_counts:

df

                                         text
0         "Perhaps she'll be the one for me."
1                         "Is it two or one?"
2           "Mayhaps it be three afterall..."
3             "Three times and it's a charm."
4  "One fish, two fish, red fish, blue fish."
5          "There's only one cat in the hat."
6     "One does not simply code into pandas."
7                       "Two nights later..."
8             "Quoth the Raven... nevermore."

rgx = '({})'.format('|'.join(word_list))
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()

one      5
two      3
three    2
Name: 0, dtype: int64

Details

rgx
'(one|two|three)'

df.text.str.lower().str.extractall(rgx).iloc[:, 0]

   match
0  0          one
1  0          two
   1          one
2  0        three
3  0        three
4  0          one
   1          two
5  0          one
6  0          one
7  0          two
Name: 0, dtype: object

Performance

Small

# Zero's code 
%%timeit 
pd.Series({w: df.text.str.count(w, flags=re.IGNORECASE).sum() for w in word_list}).sort_values(ascending=False)
1000 loops, best of 3: 1.55 ms per loop
# Andy's code
%%timeit
long_string = "".join(df.iloc[:, 0]).lower()
for w in word_list:
     long_string.count(w)

10000 loops, best of 3: 132 µs per loop
%%timeit
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
100 loops, best of 3: 2.53 ms per loop

Large

df = pd.concat([df] * 100000)
%%timeit 
pd.Series({w: df.text.str.count(w, flags=re.IGNORECASE).sum() for w in word_list}).sort_values(ascending=False)
1 loop, best of 3: 4.34 s per loop
%%timeit
long_string = "".join(df.iloc[:, 0]).lower()
for w in word_list:
    long_string.count(w)

10 loops, best of 3: 151 ms per loop
%%timeit 
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
1 loop, best of 3: 4.12 s per loop

5 Comments

I am intrigues by which of these will be the most performant, I claim it's either this one or mine. Unclear how much indirection python adds here (I don't think the DataFrame helps any). Is the .iloc[:, 0] since it returns a DataFrame rather than a Series?
@AndyHayden Yes. extractall returns a Multiindex with one column, that's why I just grab the first column and call value_counts. I'll add some timings for this data :-)
@AndyHayden The winner is your answer by miles. Wow!
@AndyHayden Though I haven't increased w. That'd make a huge different I'd assume.
yup! make it 50 odd runs through and it's not looking so good, basically the playoff is O(n * m), my answer basically relies of python string manipulation (which is very fast... if we only do it a few times) as no indirection/basically just C. I suspect Zero's answer will also be hit with m similarly to mine.
0

Use

In [52]: pd.Series({w: df.TEXT.str.contains(w, case=False).sum() for w in word_list})
Out[52]:
one      5
three    2
two      3
dtype: int64

Or, to count multiple instances in each row

In [53]: pd.Series({w: df.TEXT.str.count(w, flags=re.IGNORECASE).sum() for w in word_list})
Out[53]:
one      5
three    2
two      3
dtype: int64

Use sort_values

In [55]: s = pd.Series({w: df.TEXT.str.count(w, flags=re.IGNORECASE).sum() for w in word_list})

In [56]: s.sort_values(ascending=False)
Out[56]:
one      5
two      3
three    2
dtype: int64

3 Comments

Excellent! This worked wonderfully. By chance, is there also a way to organize the list from most frequent to least frequent?
@Leggerless Yes, you call sort_values.
Awesome. Thanks a lot for the quick response.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.