2

I am writing a program which takes the explanation of a German idiom from Wikipedia, for example, and catches the idiom, its meaning and any additional information.

Example, the bolded text is supposed to be matched:

** Sich wie ein Backfisch benehmen – albern bzw. unreif sein. Zur Etymologie des Wortes „Backfisch“ für unreife Mädchen siehe dort. (Sprichwort um 1900: „Mit 14 Jahr’n und sieben Wochen ist der Backfisch ausgekrochen.“[6]).

Basically, the phrase starts after the dash - and ends before the first full stop, i.e. it is only one sentence. However, i want to skip abbreviations such as bzw., z. B., u. A., etc, since they do not mark the end of the sentence.

I am unsure how to skip the word, but still match it. Also, as I said, I want to skip frequently used abbreviations in German such as the aforementioned ones in italics.

I already tried matching a structure beginning with - and ending with ., whereas the . should not be preceded by bzw. However, I did not succeed in doing that.

5
  • How would one determine what is an abbreviation or not? Commented Aug 18, 2019 at 18:23
  • "since they do not mark the end of the sentence" - do they never mark the end of a sentence? Commented Aug 18, 2019 at 18:33
  • @iakobski no they don't, it's the same as in English, where i.e. or e.g. would never be the last word in a sentence. Commented Aug 18, 2019 at 19:24
  • @UnbearableLightness This is a trivial answer, but I just do, because I know German. I am not using any programming logic for that. Commented Aug 18, 2019 at 19:36
  • You can't parse English, German, French or any language with regex. It is utterly impossible. Commented Aug 18, 2019 at 21:31

2 Answers 2

6

Use a non-capturing group. Take a look:

(?<=– )(?:.+)?(?:bzw\.|Z\. b\.|u\. a\.)[^\.]+

enter image description here

Regex Demo - top right you can see description of individual regex components.

(?<=– )                    start after – character + whitespace, but not match
(?:.+)?                    add any text before abbreviation into non-capturing group.
(?:bzw\.|Z\. b\.|u\. a\.)  add abbreviations into non-capturing group. Escape the dots via \. 
[^\.]+                     match anything until fullstop

Essentially the idea is to start with the – character + whitespace, but not match it. Then capture any following text, abbreviation and capture till the first dot ., but without capturing the abbreviations group (notice the ?:). Since the abbreviation dot is part of the non-capturing group, we "skip" it and continue until the dot that ends the sentence. You can expand the abbreviations list by adding more abbreviations via the | symbol.

Bonus:

If you are anticipating that you will not always start with the sequence, you can do the following:

(?:– |: )((?:.+)?(?:bzw\.|Z\. b\.|u\. a\.)[^\.]+)

This will allow the regex to work also with : character instead of , for example, but you will need to retrieve the result as group 1.

Regex Demo

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! To further complicate the situation, let's say the sentence is "albern oder unreif sein" (replaced the bzw. with oder). I tried modifying the regex, so that it captures the sentence, but did not succeed, is it possible to do that?
@Daka sure, just add oder using | -> ...(?:oder|bzw\.|Z\. b\.|u\. a\.)... regex101.com/r/9BrekC/1
1

That is a problem in German, using abbreviations, I encountered it when working on German texts, too. Did you try to use a German parser, in order to cut your text in phrases/sentences ? Try one, it may help. In Python you have NLTK and also Stanford, for example.

In English or French one may say that the end of a phrase is marked by a point followed by space and a capital letter. However this will not work for German, as the Nouns are capitalized.

On the other hand, as you mention "frequently used abbreviations" -- if they are so frequent, why not collecting them in a dictionary and use them in order to skip them in the text ?

2 Comments

Thank you for the idea! I tried using NLTK package and tokenizing the text and thereby splitting it into sentences. However, the tokenizer falsely recognises the . after bzw as the end of the sentence.
Try Stanford parser too.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.