Regex: Complement a group of characters (Python)

Question

I want to write a regex to check if a word ends in anything except s,x,y,z,ch,sh or a vowel, followed by an s. Here's my failed attempt:

re.match(r".*[^ s|x|y|z|ch|sh|a|e|i|o|u]s",s)

What is the correct way to complement a group of characters?

Do you have to check for a word boundary, or is the value of s exactly the word in question? — LarsH
– LarsH, Commented Nov 13, 2013 at 9:47
You might want to consider whitespace characters as well. As you state that you want to check if a word ends with a certain character sequence, you probably do not want to match word combinations to which these conditions apply. So make sure to include \s in your pattern. — Xiphias
– Xiphias, Commented Nov 13, 2013 at 9:55

Ashwini Chaudhary · Accepted Answer · 2013-11-13 09:52:25Z

3

Non-regex solution using str.endswith:

>>> from itertools import product
>>> tup = tuple(''.join(x) for x in product(('s','x','y','z','ch','sh'), 's'))
>>> 'foochf'.endswith(tup)
False
>>> 'foochs'.endswith(tup)
True

edited Nov 13, 2013 at 9:52

answered Nov 13, 2013 at 9:45

Ashwini Chaudhary

252k60 gold badges478 silver badges519 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

poke Over a year ago

+1 for a non-regex solution for a not-necessarily regex problem :)

poke · Accepted Answer · 2013-11-13 10:23:17Z

2

[^ s|x|y|z|ch|sh|a|e|i|o|u]

This is an inverted character class. Character classes match single characters, so in your case, it will match any character, except one of these: acehiosuxyz |. Note that it will not respect compound groups like ch and sh and the | are actually interpreted as pipe characters which just appear multiple time in the character class (where duplicates are just ignored).

So this is actually equivalent to the following character class:

[^acehiosuxyz |]

Instead, you will have to use a negative look behind to make sure that a trailing s is not preceded by any of the character sequences:

.*(?<!.[ sxyzaeiou]|ch|sh)s

This one has the problem that it will not be able to match two character words, as, to be able to use look behinds, the look behind needs to have a fixed size. And to include both the single characters and the two-character groups in the look behind, I had to add another character to the single character matches. You can however use two separate look behinds instead:

.*(?<![ sxyzaeiou])(?<!ch|sh)s

As LarsH mentioned in the comments, if you really want to match words that end with this, you should add some kind of boundary at the end of the expression. If you want to match the end of the string/line, you should add a $, and otherwise you should at least add a word boundary \b to make sure that the word actually ends there.

edited Nov 13, 2013 at 10:23

answered Nov 13, 2013 at 9:54

poke

392k80 gold badges596 silver badges632 bronze badges

6 Comments

georg Over a year ago

The latter will fail on words like as.

LarsH Over a year ago

This will also fail when the regexp matches something that's not the end of the string, as in catsup, won't it? So you need a $ on the end. Also note @thg435's point that variable-length lookbehind is not allowed.

poke Over a year ago

@LarsH The original regexp didn’t consider this case either, so I have left it out; still a good point! (The look-behind did not have a variable length though)

LarsH Over a year ago

@poke: True about the original, and I made the same mistake, but since the question specified the end of the word, I don't think we're excused regarding that point. Regarding variable length, I see what you mean about the extra ., which as thg pointed out, had other incorrect consequences. Good explanation, BTW.

poke Over a year ago

@LarsH Further mentioned the boundary case. And thanks but sometimes it seems that detailed explanations aren’t favored over short solution-only answers… but I’m glad that at least OP found this useful.

|

georg · Accepted Answer · 2013-11-13 09:53:44Z

1

It looks like you need a negative lookbehind here:

import re
rx = r'(?<![sxyzaeiou])(?<!ch|sh)s$'

print re.search(rx, 'bots')  # ok
print re.search(rx, 'boxs')  # None

Note that re doesn't support variable-width LBs, therefore you need two of them.

answered Nov 13, 2013 at 9:53

georg

216k57 gold badges324 silver badges401 bronze badges

Comments

LarsH · Accepted Answer · 2013-11-13 10:55:31Z

0

How about

re.search("([^sxyzaeiouh]|[^cs]h)s$", s)

Using search() instead of match() means the match doesn't have to begin at the beginning of the string, so we can eliminate the .*.

This is assuming that the end of the word is the end of the string; i.e. we don't have to check for a word boundary.

It also assumes that you don't need to match the "word" hs, even it conforms literally to your rules. If you want to match that as well, you could add another alternative:

re.search("([^sxyzaeiouh]|[^cs]|^h)s$", s)

But again, we're assuming that the beginning of the word is the beginning of the string.

Note that the raw string notation, r"...", is unecessary here (but harmless). It only helps when you have backslashes in the regexp, so that you don't have to escape them in the string notation.

edited Nov 13, 2013 at 10:55

answered Nov 13, 2013 at 9:51

LarsH

28.1k9 gold badges99 silver badges162 bronze badges

2 Comments

Jo So Over a year ago

This matches "wst", for example. Shouldn't you add an end-of-string or end-of-word symbol to the end of the regex?

LarsH Over a year ago

@JoSo: You're right. I was just coming to that conclusion too.

Collectives™ on Stack Overflow

Regex: Complement a group of characters (Python)

4 Answers 4

1 Comment

6 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

6 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related