0

I'm currently trying to clean a 1-gram file. Some of the words are as follows:

  1. word - basic word, classical case
  2. word. - basic word but with a dot
  3. w.s.f.w. - (word stands for word) - correct acronym
  4. w.s.f.w - incorrect acronym (missing the last dot)

My current implementation considers two different RegExes because I haven't succeeded in combining them in one. The first RegEx recognises basic words:

find_word_pattern = re.compile(r'[A-Za-z]', flags=re.UNICODE)

The second one is used in order to recognise acronyms:

find_acronym_pattern = re.compile(r'([A-Za-z]+(?:\.))', flags=re.UNICODE)

Let's say that I have an input_word as a sequence of characters. The output is obtained with:

"".join(re.findall(pattern, input_word))

Then I choose which output to use based on the length: the longer the output the better. My strategy works well with case no. 1 where both patterns return the same length.

Case no. 2 is problematic because my approach produces word. (with dot) but I need it to return word (without dot). Currently the case is decided in favour of find_acronym_pattern that produces longer sequence.

The case no. 3 works as expected.

The case no. 4: find_acronym_pattern misses the last character meaning that it produces w.s.f. whereas find_word_pattern produces wsfw.

I'm looking for a RegEx (preferably one instead of two that are currently used) that:

  1. given word returns word

  2. given word. returns word

  3. given w.s.f.w. returns w.s.f.w.

  4. given w.s.f.w returns w.s.f.w.

  5. given m.in returns m.in.

3
  • [A-Za-z] is still only the 26 letters from A-Z, the UNICODE flag has no bearing on what character classes mean. The Unicode category \p{L} would be more appropriate if you mean "all letters". Commented Apr 27, 2019 at 17:11
  • That's my mistake, sorry. Since I'm dealing with Polish in my original problem I'm using this regex re.compile(r'([A-Za-zęóąśłżźćńĘÓĄŚŁŻŹĆŃ]+(?:\.))', flags=re.UNICODE) Commented Apr 27, 2019 at 17:18
  • 1
    In this case the UNICODE flag is redundant. Commented Apr 27, 2019 at 17:19

2 Answers 2

2

A regular expression will never return what is not there, so you can forget about requirement 5. What you can do is always drop the final period, and add it back if the result contains embedded periods. That will give you the result you want, and it's pretty straightforward:

found = re.findall(r"\w+(?:\.\w+)*", input_word)[0]
if "." in found:
    found += "."

As you see I match a word plus any number of ".part" suffixes. Like your version, this matches not only single letter acronyms but longer abbreviations like Ph.D., Prof.Dr., or whatever.

Sign up to request clarification or add additional context in comments.

Comments

1

If you want one regex, you can use something like this:

((?:[A-Za-z](\.))*[A-Za-z]+)\.?

And substitute with:

\1\2

Regex demo.

Python 3 example:

import re

regex = r"((?:[A-Za-z](\.))*[A-Za-z]+)\.?"
test_str = ("word\n" "word.\n" "w.s.f.w.\n" "w.s.f.w\n" "m.in")
subst = "\\1\\2"

result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

Output:

word
word
w.s.f.w.
w.s.f.w.
m.in.

Python demo.

4 Comments

Great answer! :) However it turns out that inputs such as word- are returned with a trailing hyphen. Is it possible to tweak a RegEx (for the sake of beauty to be honest) and get rid of the trailing hyphen?
@balkon16 Well, there are two things. 1) I actually made a typo in the regex (corrected now). 2) a hyphen is simply ignored as you never mentioned it in your question. To get rid of the trailing hyphen, you can simply add -? at the end of the fixed pattern above. Here's a demo. Or, if you want, you can add something like [^\w\s]? instead to remove any trailing non-word and non-whitespace character.
Your answer still works with hyphenated words :D If I use the -? suggestion I lose the ability to recognise hyphenated words (e.g. word-word is recognised as wordword). I guess I'm gonna use your answer as it is and check for the trailing hyphen with output[-1] == "-"
@balkon16 That's because a hyphen is not a word character and you never mentioned it in your post :-) That being said, if you want to treat word-word as one word and yet get rid of the trailing hyphen, you may use ((?:[A-Za-z](\.))*[A-Za-z]+(?:-[A-Za-z]+)*)-?\.?. Here's a demo: regex101.com/r/oiDQRy/5

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.