I am writing a program which takes the explanation of a German idiom from Wikipedia, for example, and catches the idiom, its meaning and any additional information.
Example, the bolded text is supposed to be matched:
** Sich wie ein Backfisch benehmen – albern bzw. unreif sein. Zur Etymologie des Wortes „Backfisch“ für unreife Mädchen siehe dort. (Sprichwort um 1900: „Mit 14 Jahr’n und sieben Wochen ist der Backfisch ausgekrochen.“[6]).
Basically, the phrase starts after the dash - and ends before the first full stop, i.e. it is only one sentence. However, i want to skip abbreviations such as bzw., z. B., u. A., etc, since they do not mark the end of the sentence.
I am unsure how to skip the word, but still match it. Also, as I said, I want to skip frequently used abbreviations in German such as the aforementioned ones in italics.
I already tried matching a structure beginning with - and ending with ., whereas the . should not be preceded by bzw. However, I did not succeed in doing that.
