How to 'skip' specific words using regex in python?

Question

I am writing a program which takes the explanation of a German idiom from Wikipedia, for example, and catches the idiom, its meaning and any additional information.

Example, the bolded text is supposed to be matched:

** Sich wie ein Backfisch benehmen – albern bzw. unreif sein. Zur Etymologie des Wortes „Backfisch“ für unreife Mädchen siehe dort. (Sprichwort um 1900: „Mit 14 Jahr’n und sieben Wochen ist der Backfisch ausgekrochen.“[6]).

Basically, the phrase starts after the dash - and ends before the first full stop, i.e. it is only one sentence. However, i want to skip abbreviations such as bzw., z. B., u. A., etc, since they do not mark the end of the sentence.

I am unsure how to skip the word, but still match it. Also, as I said, I want to skip frequently used abbreviations in German such as the aforementioned ones in italics.

I already tried matching a structure beginning with - and ending with ., whereas the . should not be preceded by bzw. However, I did not succeed in doing that.

"since they do not mark the end of the sentence" - do they never mark the end of a sentence? — iakobski
– iakobski, Commented Aug 18, 2019 at 18:33
@iakobski no they don't, it's the same as in English, where i.e. or e.g. would never be the last word in a sentence. — Daka
– Daka, Commented Aug 18, 2019 at 19:24
@UnbearableLightness This is a trivial answer, but I just do, because I know German. I am not using any programming logic for that. — Daka
– Daka, Commented Aug 18, 2019 at 19:36
You can't parse English, German, French or any language with regex. It is utterly impossible. — user557597
– user557597, Commented Aug 18, 2019 at 21:31

vs97 · Accepted Answer · 2019-08-19 13:48:00Z

6

Use a non-capturing group. Take a look:

(?<=– )(?:.+)?(?:bzw\.|Z\. b\.|u\. a\.)[^\.]+

Regex Demo - top right you can see description of individual regex components.

(?<=– )                    start after – character + whitespace, but not match
(?:.+)?                    add any text before abbreviation into non-capturing group.
(?:bzw\.|Z\. b\.|u\. a\.)  add abbreviations into non-capturing group. Escape the dots via \. 
[^\.]+                     match anything until fullstop

Essentially the idea is to start with the – character + whitespace, but not match it. Then capture any following text, abbreviation and capture till the first dot ., but without capturing the abbreviations group (notice the ?:). Since the abbreviation dot is part of the non-capturing group, we "skip" it and continue until the dot that ends the sentence. You can expand the abbreviations list by adding more abbreviations via the | symbol.

Bonus:

If you are anticipating that you will not always start with the – sequence, you can do the following:

(?:– |: )((?:.+)?(?:bzw\.|Z\. b\.|u\. a\.)[^\.]+)

This will allow the regex to work also with : character instead of –, for example, but you will need to retrieve the result as group 1.

Regex Demo

edited Aug 19, 2019 at 13:48

answered Aug 18, 2019 at 20:02

vs97

5,8303 gold badges32 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Daka Over a year ago

Thank you! To further complicate the situation, let's say the sentence is "albern oder unreif sein" (replaced the bzw. with oder). I tried modifying the regex, so that it captures the sentence, but did not succeed, is it possible to do that?

vs97 Over a year ago

@Daka sure, just add oder using | -> ...(?:oder|bzw\.|Z\. b\.|u\. a\.)... regex101.com/r/9BrekC/1

marc_s · Accepted Answer · 2019-10-02 05:25:01Z

1

That is a problem in German, using abbreviations, I encountered it when working on German texts, too. Did you try to use a German parser, in order to cut your text in phrases/sentences ? Try one, it may help. In Python you have NLTK and also Stanford, for example.

In English or French one may say that the end of a phrase is marked by a point followed by space and a capital letter. However this will not work for German, as the Nouns are capitalized.

On the other hand, as you mention "frequently used abbreviations" -- if they are so frequent, why not collecting them in a dictionary and use them in order to skip them in the text ?

edited Oct 2, 2019 at 5:25

marc_s

760k186 gold badges1.4k silver badges1.5k bronze badges

answered Aug 18, 2019 at 18:43

Catalina Chircu

1,5742 gold badges10 silver badges20 bronze badges

2 Comments

Daka Over a year ago

Thank you for the idea! I tried using NLTK package and tokenizing the text and thereby splitting it into sentences. However, the tokenizer falsely recognises the . after bzw as the end of the sentence.

Catalina Chircu Over a year ago

Try Stanford parser too.

Collectives™ on Stack Overflow

How to 'skip' specific words using regex in python?

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related