Finding occurrences of substrings within pandas dataframe -- Python

Question

I have a list of 'words' I want to count below

word_list = ['one','two','three']

And I have a column within pandas dataframe with text below.

TEXT
-----
"Perhaps she'll be the one for me."
"Is it two or one?"
"Mayhaps it be three afterall..."
"Three times and it's a charm."
"One fish, two fish, red fish, blue fish."
"There's only one cat in the hat."
"One does not simply code into pandas."
"Two nights later..."
"Quoth the Raven... nevermore."

The desired output that I would like is the following below, where I want to count the number of times the substrings defined in word_list appear in the strings of each row in the dataframe.

Word | Count
one        5     
two        3     
three      2

Is there a way to do this in Python 2.7?

Andy Hayden · Accepted Answer · 2017-10-24 03:49:31Z

2

I would do this with vanilla python, first join the string:

In [11]: long_string = "".join(df[0]).lower()

In [12]: long_string[:50]  # all the words glued up
Out[12]: "perhaps she'll be the one for me.is it two or one?"

In [13]: for w in word_list:
     ...:     print(w, long_string.count(w))
     ...:
one 5
two 3
three 2

If you want to return a Series, you could use a dict comprehension:

In [14]: pd.Series({w: long_string.count(w) for w in word_list})
Out[14]:
one      5
three    2
two      3
dtype: int64

answered Oct 24, 2017 at 3:49

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Andy Hayden Over a year ago

@SunnysinhSolanki ah yes, depends what is column name of your strings, I guess should be " ".join(df["TEXT"]).lower()

Sunnysinh Solanki Over a year ago

I think this line should happen like this long_string = " ".join(df[0]).lower() ## Added space in join Because it might cause issue if we get list like below: ['text on','ending with'] == 'text onending with' This will match 'one' which it should not.

Andy Hayden Over a year ago

@SunnysinhSolanki yes, good spot! I guess we could also add "\x00" if we wanted to be clever (assuming you're not matching strings with null.

Andy Hayden Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ what is n? It depends on how many substrings you have (if few this will be faster, I guess it's O(n*m) for n is long_string length and m is number of substrings)

Andy Hayden Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ which is to say, if m is large yours will be faster (but if m is small this might beat you).

cs95 · Accepted Answer · 2017-10-24 04:06:24Z

1

Use str.extractall + value_counts:

df

                                         text
0         "Perhaps she'll be the one for me."
1                         "Is it two or one?"
2           "Mayhaps it be three afterall..."
3             "Three times and it's a charm."
4  "One fish, two fish, red fish, blue fish."
5          "There's only one cat in the hat."
6     "One does not simply code into pandas."
7                       "Two nights later..."
8             "Quoth the Raven... nevermore."

rgx = '({})'.format('|'.join(word_list))
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()

one      5
two      3
three    2
Name: 0, dtype: int64

Details

rgx
'(one|two|three)'

df.text.str.lower().str.extractall(rgx).iloc[:, 0]

   match
0  0          one
1  0          two
   1          one
2  0        three
3  0        three
4  0          one
   1          two
5  0          one
6  0          one
7  0          two
Name: 0, dtype: object

Performance

Small

# Zero's code 
%%timeit 
pd.Series({w: df.text.str.count(w, flags=re.IGNORECASE).sum() for w in word_list}).sort_values(ascending=False)
1000 loops, best of 3: 1.55 ms per loop

# Andy's code
%%timeit
long_string = "".join(df.iloc[:, 0]).lower()
for w in word_list:
     long_string.count(w)

10000 loops, best of 3: 132 µs per loop

%%timeit
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
100 loops, best of 3: 2.53 ms per loop

Large

df = pd.concat([df] * 100000)

%%timeit 
pd.Series({w: df.text.str.count(w, flags=re.IGNORECASE).sum() for w in word_list}).sort_values(ascending=False)
1 loop, best of 3: 4.34 s per loop

%%timeit
long_string = "".join(df.iloc[:, 0]).lower()
for w in word_list:
    long_string.count(w)

10 loops, best of 3: 151 ms per loop

%%timeit 
df['text'].str.lower().str.extractall(rgx).iloc[:, 0].value_counts()
1 loop, best of 3: 4.12 s per loop

edited Oct 24, 2017 at 4:06

answered Oct 24, 2017 at 3:48

cs95

406k106 gold badges744 silver badges797 bronze badges

5 Comments

Andy Hayden Over a year ago

I am intrigues by which of these will be the most performant, I claim it's either this one or mine. Unclear how much indirection python adds here (I don't think the DataFrame helps any). Is the .iloc[:, 0] since it returns a DataFrame rather than a Series?

cs95 Over a year ago

@AndyHayden Yes. extractall returns a Multiindex with one column, that's why I just grab the first column and call value_counts. I'll add some timings for this data :-)

cs95 Over a year ago

@AndyHayden The winner is your answer by miles. Wow!

cs95 Over a year ago

@AndyHayden Though I haven't increased w. That'd make a huge different I'd assume.

Andy Hayden Over a year ago

yup! make it 50 odd runs through and it's not looking so good, basically the playoff is O(n * m), my answer basically relies of python string manipulation (which is very fast... if we only do it a few times) as no indirection/basically just C. I suspect Zero's answer will also be hit with m similarly to mine.

Zero · Accepted Answer · 2017-10-24 03:53:41Z

0

Use

In [52]: pd.Series({w: df.TEXT.str.contains(w, case=False).sum() for w in word_list})
Out[52]:
one      5
three    2
two      3
dtype: int64

Or, to count multiple instances in each row

In [53]: pd.Series({w: df.TEXT.str.count(w, flags=re.IGNORECASE).sum() for w in word_list})
Out[53]:
one      5
three    2
two      3
dtype: int64

Use sort_values

In [55]: s = pd.Series({w: df.TEXT.str.count(w, flags=re.IGNORECASE).sum() for w in word_list})

In [56]: s.sort_values(ascending=False)
Out[56]:
one      5
two      3
three    2
dtype: int64

edited Oct 24, 2017 at 3:53

answered Oct 24, 2017 at 3:46

Zero

77.4k22 gold badges153 silver badges153 bronze badges

3 Comments

Leggerless Over a year ago

Excellent! This worked wonderfully. By chance, is there also a way to organize the list from most frequent to least frequent?

cs95 Over a year ago

@Leggerless Yes, you call sort_values.

Leggerless Over a year ago

Awesome. Thanks a lot for the quick response.

Collectives™ on Stack Overflow

Finding occurrences of substrings within pandas dataframe -- Python

3 Answers 3

5 Comments

Small

Large

5 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Small

Large

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related