1

I need to implement an algorithm for multiple extended string matching in text.

Extended means the presence of wildcards (any number of characters instead of a star), for example:

abc*def //matches abcdef, abcpppppdef etc.

Multiple means that the search is going on simultaneously for multiple string patterns (not a separate search for each pattern), for example:

abc*def
abc
whatever
some*string

QUESTION:

What is the fast algorithm that can do multiple extended string matching?

Preferably, optimized for SIMD instructions and multicore implementation. Open source implementation (C/C++/Python) would be great as well.

Thank you

1 Answer 1

2

I think that it might make sense to start by reading the following Wikipedia article's section: http://en.wikipedia.org/wiki/Regular_expression#Implementations_and_running_times. You can then perform a literature review on algorithms, implementing regular expression pattern matching.

In terms of practical implementation, there is a large variety of regular expression (regex) engines in a form of libraries, focused on one or more programming languages. Most likely, the best and most popular option is the C/C++ PCRE library, with its newest version PCRE2, released in 2015. Another C++ regex library, which is quite popular at Google, is RE2. I recommend you to read this paper, along with the two other, linked within the article, for details on algorithms, implementation and benchmarks. Just recently, Google has released RE2/J - a linear time version of RE2 for Java: see this blog post for details. Finally, I ran across an interesting pure C regex library TRE, which offers way too many cool features to list here. However, you can read about them all on this page.

P.S. If the above is not enough for you, feel free to visit this Wikipedia page for details of many more regex engines/libraries and their comparison across several criteria. Hope my answer helps.

Sign up to request clarification or add additional context in comments.

4 Comments

Aleksandr Blekh, thank you for the great reply! I'm looking for the speed processing of at least 10 Gbps on a single core of a modern CPU. Also, I don't need regular expression processing, just multiple extended string matching. From the paper you mentioned it seems that speeds wasn't bad.
By the way, do you have any practical experience with any of these engines? What can you recommend from your experience?
I would like to emphasize my interest in algorithms/implementations that rely heavily on SIMD instructions in modern CPUs.
@Konstantin, you're very welcome! I'm glad you liked my answer. Frankly, I don't have direct experience with libraries I've referred to, other than reading about them - sorry I can't be of more help in that regard. If I'll have additional thoughts or information, related to your questions and comments, I will certainly let you know. Good luck!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.