1

I want to count how often a regex-expression (prior and ensuing characters are needed to identify the pattern) occurs in multiple dataframe columns. I found a solution which seems a litte slow. Is there a more sophisticated way?

column_A column_B column_C
Test • test abc winter • sun snow rain blank
blabla • summer abc break • Data test letter • stop.

So far I created a solution which is slow:

print(df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum())
2
  • If you want a solution to improve the timings you need to provide a better testing data. How did you come to the conclusion is slow? Commented Jul 22, 2022 at 7:27
  • The regexps can be a bit faster if you replace the lookbehinds with consuming pattern, (?<=[A-Za-z]) > [A-Za-z] Commented Jul 22, 2022 at 7:33

2 Answers 2

1

The str.count should be able to apply to the whole dataframe without hard coding this way. Try

sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))

I have tried with 1000 * 1000 dataframes. Here is a benchmark for your reference.

%timeit sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))
1.97 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up to request clarification or add additional context in comments.

Comments

0

You can use list comprehension and re.search. You can reduce 938 µs to 26.7 µs. (make sure don't create list and use generator)

res = sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item))
       for col in ['column_A', 'column_B','column_C'])
print(res)
# 5

Benchmark:

%%timeit 
sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item)) for col in ['column_A', 'column_B','column_C'])
# 26 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit 
df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()
# 938 µs ± 149 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# --------------------------------------------------------------------#

3 Comments

@KevinChoonLiangYew, Add to your answer, Then I will check
No offense, you answer is better. I'm fine with you doing timing for me :D
@KevinChoonLiangYew, Do I check Timing your answer?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.