2

I want to write a regex to check if a word ends in anything except s,x,y,z,ch,sh or a vowel, followed by an s. Here's my failed attempt:

re.match(r".*[^ s|x|y|z|ch|sh|a|e|i|o|u]s",s)

What is the correct way to complement a group of characters?

2
  • Do you have to check for a word boundary, or is the value of s exactly the word in question? Commented Nov 13, 2013 at 9:47
  • You might want to consider whitespace characters as well. As you state that you want to check if a word ends with a certain character sequence, you probably do not want to match word combinations to which these conditions apply. So make sure to include \s in your pattern. Commented Nov 13, 2013 at 9:55

4 Answers 4

3

Non-regex solution using str.endswith:

>>> from itertools import product
>>> tup = tuple(''.join(x) for x in product(('s','x','y','z','ch','sh'), 's'))
>>> 'foochf'.endswith(tup)
False
>>> 'foochs'.endswith(tup)
True
Sign up to request clarification or add additional context in comments.

1 Comment

+1 for a non-regex solution for a not-necessarily regex problem :)
2
[^ s|x|y|z|ch|sh|a|e|i|o|u]

This is an inverted character class. Character classes match single characters, so in your case, it will match any character, except one of these: acehiosuxyz |. Note that it will not respect compound groups like ch and sh and the | are actually interpreted as pipe characters which just appear multiple time in the character class (where duplicates are just ignored).

So this is actually equivalent to the following character class:

[^acehiosuxyz |]

Instead, you will have to use a negative look behind to make sure that a trailing s is not preceded by any of the character sequences:

.*(?<!.[ sxyzaeiou]|ch|sh)s

This one has the problem that it will not be able to match two character words, as, to be able to use look behinds, the look behind needs to have a fixed size. And to include both the single characters and the two-character groups in the look behind, I had to add another character to the single character matches. You can however use two separate look behinds instead:

.*(?<![ sxyzaeiou])(?<!ch|sh)s

As LarsH mentioned in the comments, if you really want to match words that end with this, you should add some kind of boundary at the end of the expression. If you want to match the end of the string/line, you should add a $, and otherwise you should at least add a word boundary \b to make sure that the word actually ends there.

6 Comments

The latter will fail on words like as.
This will also fail when the regexp matches something that's not the end of the string, as in catsup, won't it? So you need a $ on the end. Also note @thg435's point that variable-length lookbehind is not allowed.
@LarsH The original regexp didn’t consider this case either, so I have left it out; still a good point! (The look-behind did not have a variable length though)
@poke: True about the original, and I made the same mistake, but since the question specified the end of the word, I don't think we're excused regarding that point. Regarding variable length, I see what you mean about the extra ., which as thg pointed out, had other incorrect consequences. Good explanation, BTW.
@LarsH Further mentioned the boundary case. And thanks but sometimes it seems that detailed explanations aren’t favored over short solution-only answers… but I’m glad that at least OP found this useful.
|
1

It looks like you need a negative lookbehind here:

import re
rx = r'(?<![sxyzaeiou])(?<!ch|sh)s$'

print re.search(rx, 'bots')  # ok
print re.search(rx, 'boxs')  # None

Note that re doesn't support variable-width LBs, therefore you need two of them.

Comments

0

How about

re.search("([^sxyzaeiouh]|[^cs]h)s$", s)

Using search() instead of match() means the match doesn't have to begin at the beginning of the string, so we can eliminate the .*.

This is assuming that the end of the word is the end of the string; i.e. we don't have to check for a word boundary.

It also assumes that you don't need to match the "word" hs, even it conforms literally to your rules. If you want to match that as well, you could add another alternative:

re.search("([^sxyzaeiouh]|[^cs]|^h)s$", s)

But again, we're assuming that the beginning of the word is the beginning of the string.

Note that the raw string notation, r"...", is unecessary here (but harmless). It only helps when you have backslashes in the regexp, so that you don't have to escape them in the string notation.

2 Comments

This matches "wst", for example. Shouldn't you add an end-of-string or end-of-word symbol to the end of the regex?
@JoSo: You're right. I was just coming to that conclusion too.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.