Regex in python dataframe: count occurences of pattern

Question

I want to count how often a regex-expression (prior and ensuing characters are needed to identify the pattern) occurs in multiple dataframe columns. I found a solution which seems a litte slow. Is there a more sophisticated way?

column_A	column_B	column_C
Test • test abc	winter • sun	snow rain blank
blabla • summer abc	break • Data	test letter • stop.

So far I created a solution which is slow:

print(df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum())

If you want a solution to improve the timings you need to provide a better testing data. How did you come to the conclusion is slow? — Dani Mesejo
– Dani Mesejo, Commented Jul 22, 2022 at 7:27
The regexps can be a bit faster if you replace the lookbehinds with consuming pattern, (?<=[A-Za-z]) > [A-Za-z] — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 22, 2022 at 7:33

score 1 · Accepted Answer · 2022-07-22 08:24:37Z

1

The str.count should be able to apply to the whole dataframe without hard coding this way. Try

sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))

I have tried with 1000 * 1000 dataframes. Here is a benchmark for your reference.

%timeit sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))
1.97 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Jul 22, 2022 at 8:24

answered Jul 22, 2022 at 7:41

user16836078

Sign up to request clarification or add additional context in comments.

Comments

Mahdi F. · Accepted Answer · 2022-07-22 07:48:04Z

0

You can use list comprehension and re.search. You can reduce 938 µs to 26.7 µs. (make sure don't create list and use generator)

res = sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item))
       for col in ['column_A', 'column_B','column_C'])
print(res)
# 5

Benchmark:

%%timeit 
sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item)) for col in ['column_A', 'column_B','column_C'])
# 26 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit 
df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()
# 938 µs ± 149 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# --------------------------------------------------------------------#

edited Jul 22, 2022 at 7:48

answered Jul 22, 2022 at 7:35

Mahdi F.

24.1k5 gold badges25 silver badges32 bronze badges

3 Comments

Mahdi F. Over a year ago

@KevinChoonLiangYew, Add to your answer, Then I will check

user16836078 Over a year ago

No offense, you answer is better. I'm fine with you doing timing for me :D

Mahdi F. Over a year ago

@KevinChoonLiangYew, Do I check Timing your answer?

Collectives™ on Stack Overflow

Regex in python dataframe: count occurences of pattern

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related