I'm currently trying to clean a 1-gram file. Some of the words are as follows:
word- basic word, classical caseword.- basic word but with a dotw.s.f.w.- (word stands for word) - correct acronymw.s.f.w- incorrect acronym (missing the last dot)
My current implementation considers two different RegExes because I haven't succeeded in combining them in one. The first RegEx recognises basic words:
find_word_pattern = re.compile(r'[A-Za-z]', flags=re.UNICODE)
The second one is used in order to recognise acronyms:
find_acronym_pattern = re.compile(r'([A-Za-z]+(?:\.))', flags=re.UNICODE)
Let's say that I have an input_word as a sequence of characters. The output is obtained with:
"".join(re.findall(pattern, input_word))
Then I choose which output to use based on the length: the longer the output the better. My strategy works well with case no. 1 where both patterns return the same length.
Case no. 2 is problematic because my approach produces word. (with dot) but I need it to return word (without dot). Currently the case is decided in favour of find_acronym_pattern that produces longer sequence.
The case no. 3 works as expected.
The case no. 4: find_acronym_pattern misses the last character meaning that it produces w.s.f. whereas find_word_pattern produces wsfw.
I'm looking for a RegEx (preferably one instead of two that are currently used) that:
given
wordreturnswordgiven
word.returnswordgiven
w.s.f.w.returnsw.s.f.w.given
w.s.f.wreturnsw.s.f.w.given
m.inreturnsm.in.
[A-Za-z]is still only the 26 letters from A-Z, the UNICODE flag has no bearing on what character classes mean. The Unicode category\p{L}would be more appropriate if you mean "all letters".re.compile(r'([A-Za-zęóąśłżźćńĘÓĄŚŁŻŹĆŃ]+(?:\.))', flags=re.UNICODE)