5

I have been banging my head against this for some time now: I want to capture all [a-z]+[0-9]? character sequences excluding strings such as sin|cos|tan etc. So having done my regex homework the following regex should work:

(?:(?!(sin|cos|tan)))\b[a-z]+[0-9]?

As you see I am using negative lookahead along with alternation - the \b after the non-capturing group closing parenthesis is critical to avoid matching the in of sin etc. The regex makes sense and as a matter of fact I have tried it with RegexBuddy and Java as the target implementation and get the wanted result but it doesn't work using Java Matcher and Pattern objects! Any thoughts?

cheers

6
  • Note: I don't think you need ?: when you use ?!. Commented Feb 3, 2010 at 10:26
  • the ?: is for not capturing the groups with backreferences, it's there for perfomance and shouldn't be trouble. But i have tried without it to no avail Commented Feb 3, 2010 at 10:30
  • 1
    if you posted some sample inputs and what you expect from the output in each case, I think more people would be in a position to help. Commented Feb 3, 2010 at 10:33
  • 1
    @nvrs: regarding the ?: - zero-width assertions are not captured by default. As far as the regex engine is concerned, (?:(?!(sin|cos|tan))) is a complex way of saying (?!sin|cos|tan). Commented Feb 3, 2010 at 10:36
  • @ninesided: You are right. I am actually trying to parse a mathematical equation and extract the variables. The variables could be any string with characters [a-z] followed by an optional single digit. e.g. x1 + yvar2 however i want to exclude some strings such as log,sin,etc since they are bound by implemented functions by my lib. Commented Feb 3, 2010 at 10:42

3 Answers 3

6

The \b is in the wrong place. It would be looking for a word boundary that didn't have sin/cos/tan before it. But a boundary just after any of those would have a letter at the end, so it would have to be an end-of-word boundary, which is can't be if the next character is a-z.

Also, the negative lookahead would (if it worked) exclude strings like cost, which I'm not sure you want if you're just filtering out keywords.

I suggest:

\b(?!sin\b|cos\b|tan\b)[a-z]+[0-9]?\b

Or, more simply, you could just match \b[a-z]+[0-9]?\b and filter out the strings in the keyword list afterwards. You don't always have to do everything in regex.

Sign up to request clarification or add additional context in comments.

7 Comments

Matches cos1 but it should not (if I understood the requirement correctly).
@Tomalak: No, the negative lookahead is meant to match full words, not prefixes. If there were a trig function called cos1, it would be listed as such: (?!(?:sin|cos1?|tan)\b)
Yeah, the requirements aren't wholly clear, but that was my guess.
@bobince: Thanks, you were right about the the positioniong of \b. Of course the original regex would match (although not completely correct according to the equirements i described) most of what i wanted if i hand't forgotten to escape the \b for java i.e. \\b. Now i think how ridiculous \\\\ will look when you want to include a literal \ in the regex...
Yeah, backslashes easily get out of hand in nested escaping contexts! It's a pity Java doesn't have the ‘raw strings’ some languages use to get around the problem. (Or regex literals like in JS, though I personally find that a bit ugly.)
|
1

So you want [a-z]+[0-9]? (a sequence of at least one letter, optionally followed by a digit), unless that letter sequence resembles one of sin cos tan?

\b(?!(sin|cos|tan)(?=\d|\b))[a-z]+\d?\b

results:

cos   - no match
cosy  - full match
cos1  - no match
cosy1 - full match
bla9  - full match
bla99 - no match

2 Comments

Hi, thanks for replying but i still dont get any matches. I see that based on what i said you added matches such as cosy etc. which is correct but using: Pattern p = Pattern.compile("\b(?!(sin|cos|tan)(?=[^a-z]|\b))[a-z]+[0-9]?\b"); Matcher m = f.matcher(stringToMatch); i get no matches at all!
In Java strings backslashes need to be escaped. I have shown the pure regex. Of course you need to adapt it to the string escaping rules of your programming language yourself.
0

i forgot to escape the \b for java so \b should be \\b and it now works. cheers

2 Comments

When posting regex questions, it's a good idea to include the regex exactly as it appears in your source code; \bfoo\b looks fine, but "\bfoo\b" is likely to raise questions, even from people who don't speak Java and aren't sure how its string literals work.
Also, did you try having RegexBuddy generate the Java source code? (That's the "Use" tab, in case you don't know.) I've never liked auto-generated source code, but I sometimes use "Use" to remind myself about the escaping rules for languages I'm not fluent in.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.