4

I have a regex that matches all three characters words in a string:

\b[^\s]{3}\b

When I use it with the string:

And the tiger attacked you.

this is the result:

regex = re.compile("\b[^\s]{3}\b")
regex.findall(string)
[u'And', u'the', u'you']

As you can see it matches you as a word of three characters, but I want the expression to take "you." with the "." as a 4 chars word.

I have the same problem with ",", ";", ":", etc.

I'm pretty new with regex but I guess it happens because those characters are treated like word boundaries.

Is there a way of doing this?

Thanks in advance,

EDIT

Thaks to the answers of @BrenBarn and @Kendall Frey I managed to get to the regex I was looking for:

(?<!\w)[^\s]{3}(?=$|\s)
6
  • 3
    It obviously won't match a 4-character anything if you tell it it must match exactly 3 characters. What exactly are the rules you want to use to decide if/when to match a fourth character? Commented May 2, 2013 at 19:18
  • 1
    I don't whant it to match, I just want you. to be treated as 4-char words so it doesn't match the regex Commented May 2, 2013 at 19:23
  • What characters do you want to count as word boundaries? Commented May 2, 2013 at 19:24
  • Just blank spaces and ends of line Commented May 2, 2013 at 19:27
  • Can you please accept an answer? Also, why are you using \Z and not $? I think they will do the same thing in this case, but $ is more recognizable. Commented May 2, 2013 at 21:00

3 Answers 3

3

If you want to make sure the word is preceded and followed by a space (and not a period like is happening in your case), then use lookaround.

(?<=\s)\w{3}(?=\s)

If you need it to match punctuation as part of words (such as 'in.') then \w won't be adequate, and you can use \S (anything but a space)

(?<=\s)\S{3}(?=\s)
Sign up to request clarification or add additional context in comments.

5 Comments

He clarified in a comment that he doesn't want to match the punctuation; rather, he wants the period to be counted as a word character so it prevents the "word" you. from matching (because it is more than three characters).
Your example still won't work, because \w will not match periods.
Thanks guys!! I found the solution! I didn't know about lookarounds.
This regex requires words to always have whitespace round them, so it won't match your first and last word.
you could probably fix this by using (?<=\s|^) and (?=\s|$) for the lookarounds though
1

As described in the documentation:

A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character.

So if you want a period to count as a word character and not a word boundary, you can't use \b to indicate a word boundary. You'll have to use your own character class. For instance, you can use a regex like \s[^\s]{3}\s if you want to match 3 non-space characters surrounded by spaces. If you still want the boundary to be zero-width (i.e., restrict the match but not be included in it), you could use lookaround, something like (?<=\s)[^\s]{3}(?=\s).

Comments

1

This would be my approach. Also matches words that come right after punctuations.

import re

r = r'''
        \b                   # word boundary
        (                    # capturing parentheses
            [^\s]{3}         # anything but whitespace 3 times
            \b               # word boundary
            (?=[^\.,;:]|$)   # dont allow . or , or ; or : after word boundary but allow end of string
        |                    # OR
            [^\s]{2}         # anything but whitespace 2 times
            [\.,;:]          # a . or , or ; or :
        )
    '''
s = 'And the tiger attacked you. on,bla tw; th: fo.tes'

print re.findall(r, s, re.X)

output:

['And', 'the', 'on,', 'bla', 'tw;', 'th:', 'fo.', 'tes']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.