Python regex for multiple and single dots

Question

I'm currently trying to clean a 1-gram file. Some of the words are as follows:

word - basic word, classical case
word. - basic word but with a dot
w.s.f.w. - (word stands for word) - correct acronym
w.s.f.w - incorrect acronym (missing the last dot)

My current implementation considers two different RegExes because I haven't succeeded in combining them in one. The first RegEx recognises basic words:

find_word_pattern = re.compile(r'[A-Za-z]', flags=re.UNICODE)

The second one is used in order to recognise acronyms:

find_acronym_pattern = re.compile(r'([A-Za-z]+(?:\.))', flags=re.UNICODE)

Let's say that I have an input_word as a sequence of characters. The output is obtained with:

"".join(re.findall(pattern, input_word))

Then I choose which output to use based on the length: the longer the output the better. My strategy works well with case no. 1 where both patterns return the same length.

Case no. 2 is problematic because my approach produces word. (with dot) but I need it to return word (without dot). Currently the case is decided in favour of find_acronym_pattern that produces longer sequence.

The case no. 3 works as expected.

The case no. 4: find_acronym_pattern misses the last character meaning that it produces w.s.f. whereas find_word_pattern produces wsfw.

I'm looking for a RegEx (preferably one instead of two that are currently used) that:

given word returns word
given word. returns word
given w.s.f.w. returns w.s.f.w.
given w.s.f.w returns w.s.f.w.
given m.in returns m.in.

[A-Za-z] is still only the 26 letters from A-Z, the UNICODE flag has no bearing on what character classes mean. The Unicode category \p{L} would be more appropriate if you mean "all letters". — Tomalak
– Tomalak, Commented Apr 27, 2019 at 17:11
That's my mistake, sorry. Since I'm dealing with Polish in my original problem I'm using this regex re.compile(r'([A-Za-zęóąśłżźćńĘÓĄŚŁŻŹĆŃ]+(?:\.))', flags=re.UNICODE) — balkon16
– balkon16, Commented Apr 27, 2019 at 17:18

alexis · Accepted Answer · 2019-04-27 17:17:31Z

2

A regular expression will never return what is not there, so you can forget about requirement 5. What you can do is always drop the final period, and add it back if the result contains embedded periods. That will give you the result you want, and it's pretty straightforward:

found = re.findall(r"\w+(?:\.\w+)*", input_word)[0]
if "." in found:
    found += "."

As you see I match a word plus any number of ".part" suffixes. Like your version, this matches not only single letter acronyms but longer abbreviations like Ph.D., Prof.Dr., or whatever.

edited Apr 27, 2019 at 17:17

answered Apr 27, 2019 at 17:12

alexis

50.4k18 gold badges108 silver badges173 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

41686d6564 · Accepted Answer · 2019-04-27 18:13:59Z

1

If you want one regex, you can use something like this:

((?:[A-Za-z](\.))*[A-Za-z]+)\.?

And substitute with:

\1\2

Regex demo.

Python 3 example:

import re

regex = r"((?:[A-Za-z](\.))*[A-Za-z]+)\.?"
test_str = ("word\n" "word.\n" "w.s.f.w.\n" "w.s.f.w\n" "m.in")
subst = "\\1\\2"

result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

Output:

word
word
w.s.f.w.
w.s.f.w.
m.in.

Python demo.

edited Apr 27, 2019 at 18:13

answered Apr 27, 2019 at 17:15

41686d6564

19.8k13 gold badges48 silver badges84 bronze badges

4 Comments

balkon16 Over a year ago

Great answer! :) However it turns out that inputs such as word- are returned with a trailing hyphen. Is it possible to tweak a RegEx (for the sake of beauty to be honest) and get rid of the trailing hyphen?

41686d6564 Over a year ago

@balkon16 Well, there are two things. 1) I actually made a typo in the regex (corrected now). 2) a hyphen is simply ignored as you never mentioned it in your question. To get rid of the trailing hyphen, you can simply add -? at the end of the fixed pattern above. Here's a demo. Or, if you want, you can add something like [^\w\s]? instead to remove any trailing non-word and non-whitespace character.

balkon16 Over a year ago

Your answer still works with hyphenated words :D If I use the -? suggestion I lose the ability to recognise hyphenated words (e.g. word-word is recognised as wordword). I guess I'm gonna use your answer as it is and check for the trailing hyphen with output[-1] == "-"

41686d6564 Over a year ago

@balkon16 That's because a hyphen is not a word character and you never mentioned it in your post :-) That being said, if you want to treat word-word as one word and yet get rid of the trailing hyphen, you may use ((?:[A-Za-z](\.))*[A-Za-z]+(?:-[A-Za-z]+)*)-?\.?. Here's a demo: regex101.com/r/oiDQRy/5

Collectives™ on Stack Overflow

Python regex for multiple and single dots

2 Answers 2

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related