6

I need to filter a collection of strings based on a rather complex query - in it's "raw" form it looks like this:

nano* AND (regulat* OR *toxic* OR ((risk OR hazard) AND (exposure OR release)) )

An example of one of the strings to match against:

Workshop on the Second Regulatory Review on Nanomaterials, 30 January 2013, Brussels

So, I need to match using AND OR and wildcard characters - so, I presume I'll need to use a regex in JavaScript.

I have it all looping correctly, filtering and generally working, but I'm 100% sure my regex is wrong - and some results are being omitted wrongly - here it is:

/(nano[a-zA-Z])?(regulat[a-zA-Z]|[a-zA-Z]toxic[a-zA-Z]|((risk|hazard)*(exposure|release)))/i

Any help would be greatly appreciated - I really can't abstract my mind correctly to understand this syntax!

UPDATE:

Few people are point out the importance of the order in which the regex is constructed, however I have no control over the text strings that will be searched, so I need to find a solution that can work regardless of the order or either.

UPDATE:

Eventually used a PHP solution, due to deprecation of twitter API 1.0, see pastebin for example function ( I know it's better to paste code here, but there's a lot... ):

function: http://pastebin.com/MpWSGtHK usage: http://pastebin.com/pP2AHEvk

Thanks for all help

7
  • You might want to try a live RegExp testing tool. Commented Feb 26, 2013 at 13:59
  • In your example string, 'nano' comes after 'regulatory', but in your regex, its the other way round. Is there any expected pattern in this such that one will always come before the other? A few more examples would help explain your requirement. Commented Feb 26, 2013 at 14:02
  • @Barney - good advice, that's how I got this far Commented Feb 26, 2013 at 14:48
  • @Chirag64 - the strings I'm matching against were initially tweets, from this feed: twitter.com/nanoTOES - so, there is no order, we're just trying to reduce the number and increase the relevancy. Commented Feb 26, 2013 at 14:50
  • @QL Studio: I'm afraid you'll have to use multiple if conditions with AND & OR instead of trying to fit everything in a single regex in that case. Commented Feb 26, 2013 at 14:59

2 Answers 2

24

A single regex is not the right tool for this, IMO:

/^(?=.*\bnano)(?=(?:.*\bregulat|.*toxic|(?=.*(?:\brisk\b|\bhazard\b))(?=.*(?:\bexposure\b|\brelease\b))))/i.test(subject))

would return True if the string fulfills the criteria you set forth, but I find nested lookaheads quite incomprehensible. If JavaScript supported commented regexes, it would look like this:

^                 # Anchor search to start of string
(?=.*\bnano)      # Assert that the string contains a word that starts with nano
(?=               # AND assert that the string contains...
 (?:              #  either
  .*\bregulat     #   a word starting with regulat
 |                #  OR
  .*toxic         #   any word containing toxic
 |                #  OR
  (?=             #   assert that the string contains
   .*             #    any string
   (?:            #    followed by
    \brisk\b      #    the word risk
   |              #    OR
    \bhazard\b    #    the word hazard
   )              #    (end of inner OR alternation)
  )               #   (end of first AND condition)
  (?=             #   AND assert that the string contains
   .*             #    any string
   (?:            #    followed by
    \bexposure\b  #    the word exposure
   |              #    OR
    \brelease\b   #    the word release
   )              #    (end of inner OR alternation)
  )               #   (end of second AND condition)
 )                #  (end of outer OR alternation)
)                 # (end of lookahead assertion)

Note that the entire regex is composed of lookahead assertions, so the match result itself will always be the empty string.

Instead, you could use single regexes:

if (/\bnano/i.test(str) &&
    ( 
        /\bregulat|toxic/i.test(str) ||
        ( 
            /\b(?:risk|hazard)\b/i.test(str) &&
            /\b(?:exposure|release)\b/i.test(str)
        )
    )
)    /* all tests pass */
Sign up to request clarification or add additional context in comments.

3 Comments

please could you explain the [\b] - I read that "\b is a backspace character" but I'm not sure how that's relevant?
@QLStudio: In a normal string, "\b" is indeed a backspace character. In a regex, /\b/ (equivalent to new Regex("\\b")) is a word boundary anchor. This anchor matches at the start or end of an alphanumeric word. Therefore /\brisk\b/ only matches "risk" or "There is a risk!", but not "brisk" or "risky".
thanks for the explanation - I've moved away from javasript, because the version 1.0 of the API is shutting down, but the regexes should work almost as is in PHP I think - I'll post a complete answer when I've got it all fixed up.
2

Regular expressions have to move through the string in order. You have "nano" before "regulat" in the pattern, but they are swapped in the test string. Instead of using regexen to do this, I'd stick with plain old string parsing:

if (str.indexOf('nano') > -1) {
    if (str.indexOf('regulat') > -1 || str.indexOf('toxic') > -1
        || ((str.indexOf('risk') > - 1 || str.indexOf('hazard') > -1)
        && (str.indexOf('exposure') > -1 || str.indexOf('release') > -1)
    )) {
        /* all tests pass */
    }
}

If you want to actually capture the words (e.g. get "Regulatory" from where "regulat" is, I would split the sentence by word breaks and inspect individual words.

7 Comments

@EP - please see my comment above, the order of the string I'm matching against is as random as it's content.. I'm just trying to "filter" over a large collection of tweets based on the regex - perhaps this is the wrong approach?
@QLStudio is my suggestion inappropriate for that?
@EP - yes, sorry - your solution solves the order problem.. but can I still use wildcard ( * ) characters in a normal JS search?
I need to match nano* ( eg. nanotechnology ) and regulat* ( eg. regulation )
indexOf works with character sets not words .. so "nanotechnology".indexOf('nano') returns 0 (which is greater than -1)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.