3

I'm working on a database server software product (see my profile) and we see the need to implement free- text searching in our software. The query language standard we are using only supports free-text search using a BT type Regex. The only way we can use our free-text database indexes together with Regex seems to be to implement our own. My questions to SO is:

  • Where can I find papers/examples/patterns on how to implement a BT style Regex?

  • Is it worth looking into taking one of the open source C/C++ Regex libraries and altering the code to fit our needs?

3
  • 1
    This is a huge undertaking. I think it would be crazy not to start with an existing library. Hopefully there is one that closely matches your standard. If so, this seems like a relatively straightforward task. Commented Oct 1, 2012 at 11:26
  • @SingerOfTheFall: I assume BT stands for backtracking Commented Oct 1, 2012 at 11:53
  • @SingerOfTheFall There are two Regex algorithm types, BT is Perl-like syntax and FSA. See lh3lh3.users.sourceforge.net/reb.shtml Commented Oct 1, 2012 at 12:12

2 Answers 2

2

If I'm not wrong SPARQL uses the XPath/XQuery regular expression syntax which is based on PERL regular expressions (At least that is what the W3C docs say)

If this is indeed the case then you can use PCRE from http://www.pcre.org/

It is licensed as BSD so you will be able to use it in a commercial product

If your syntax is slightly modified you can probably write a small routine to normalize it to the PERL syntax used by PCRE

Sign up to request clarification or add additional context in comments.

5 Comments

It's not finding a Regex library that is the issue, we already use a Regex library. The problem is finding a Regex (or how to implement one) that we can tie together with our database indexes, this is useful since we then can use our own free-text indexes which is much faster than using Regex filtering on each result in a query.
Well. I'm not a database backend expert but when you use PCRE you pre-compile the regular expression and then you can use it against the textual fields in the record set you fetch according to other query criteria. I fail to see the problem, Maybe you can clarify your question? But if you are set on implementing your own you can still take PCRE as a reference - It's not a big library.
What you are suggesting is what is usually called serial scanning, what I'm wanti is a two-stage design pattern called indexing+searching.
I can only guess what is the structure of your full text indices - most full text indices just index words and locations in the string. Maybe you can combine the two with a small analysis of the regular expression so you can build a derived expression you can check against individual index entries and only those which pass the preliminary check are scanned in full. But probably only a small subset of the possible regular expressions can fit this scheme. However I'm not sure that implementing your own will do you any good in that respect - BUT I don't know your engine.
Yeah, I suspected as much. I have found two papers on the subject on REGEX indexing online; one from Bell Labs and one from UCLA/IBM.
1

There are two papers I have found on the subject on REGEX indexing online; one from Bell Labs and one from UCLA/IBM. I'm still not sure if to use an existing Regex library and modify it or write one from scratch.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.