How to stop the matching of regex at one string with a certain pattern?

Question

I'm trying to extract the titles of some tables from plain text with regular expression in python.

The plain test was exported from some PDF files, which had a lot of \ns. I tried to stop the matching before the first appearance of the pattern \n \n\n, but the regex always returned me some more characters.

Here's an example.

The string was:

contents = '\n\n\n\n\n\n\n\nClient: ABC area: Location Mc\nHole: 33-44   \n \n\n \n\nKJK TechCen    Rep # 5243 \n \n\n \n\n95 \n\nTable 3.1:  Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF \n \n\n PressRel V \n% \n\nLiq/To \n% \n\nLiq/Sat \nBu \n\nDenCom'

The regex I used was:

re.findall(r'Table *\d.+:* *[a-zA-Z0-9 :&–=\n%@,()°-]+ [^ \n \n\n ]', contents)

I wanted the resulting string to start from 'Table XXX' and end right before the first ' \n \n\n ', like this:

'Table 3.1:  Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF '

But the actual string I got was:

'Table 3.1:  Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF \n \n\n PressRel V'

So how could I modify the regex to get rid of the annoying '\n \n\n PressRel V'?

Then use a lookahead, or a capturing group, Table *\d.+:* *[a-zA-Z0-9 :&–=\n%@,()°-]+(?= \n \n\n ), see demo, or this demo. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Mar 27, 2019 at 13:57

The fourth bird · Accepted Answer · 2019-03-27 13:57:53Z

1

Instead of using a character class, you might use a positive lookahead (?= to assert that what should follow is directly on the right.

Table *\d.+:* *[a-zA-Z0-9 :&–=\n%@,()°-]+(?= \n \n\n )

Regex demo

Or you could capture your values in a group and match the newlines following

(Table *\d.+:* *[a-zA-Z0-9 :&–=\n%@,()°-]+) \n \n\n

Regex demo using a group

answered Mar 27, 2019 at 13:57

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

mossymountain · Accepted Answer · 2019-03-27 14:20:50Z

1

You need a non-greedy +? instead of +, since all characters that appear in the end sequence are in the middle brackets.

end = r' \n \n\n '
result = re.findall(r'Table[^:]*:[a-zA-Z0-9 :&–=\n%@,()°-]+?' + end, contents)
#result = ['Table 3.1:  Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF \n \n\n ']

# to chop off the end, if needed:
result = [x[:-len(end)] for x in result]

The [^ \n \n\n ] part in your example is equal to [^ \n], "A character that is not a newline or a space"

edited Mar 27, 2019 at 14:20

answered Mar 27, 2019 at 14:09

mossymountain

1831 silver badge10 bronze badges

1 Comment

Yujian Over a year ago

This method also worked well. Thank you for the explanation of [^ \n \n\n ].

Collectives™ on Stack Overflow

How to stop the matching of regex at one string with a certain pattern?

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related