I'm trying to extract the titles of some tables from plain text with regular expression in python.
The plain test was exported from some PDF files, which had a lot of \ns. I tried to stop the matching before the first appearance of the pattern \n \n\n, but the regex always returned me some more characters.
Here's an example.
The string was:
contents = '\n\n\n\n\n\n\n\nClient: ABC area: Location Mc\nHole: 33-44 \n \n\n \n\nKJK TechCen Rep # 5243 \n \n\n \n\n95 \n\nTable 3.1: Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF \n \n\n PressRel V \n% \n\nLiq/To \n% \n\nLiq/Sat \nBu \n\nDenCom'
The regex I used was:
re.findall(r'Table *\d.+:* *[a-zA-Z0-9 :&–=\n%@,()°-]+ [^ \n \n\n ]', contents)
I wanted the resulting string to start from 'Table XXX' and end right before the first ' \n \n\n ', like this:
'Table 3.1: Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF '
But the actual string I got was:
'Table 3.1: Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF \n \n\n PressRel V'
So how could I modify the regex to get rid of the annoying '\n \n\n PressRel V'?
Table *\d.+:* *[a-zA-Z0-9 :&–=\n%@,()°-]+(?= \n \n\n ), see demo, or this demo.