0

I'm trying to extract the titles of some tables from plain text with regular expression in python.

The plain test was exported from some PDF files, which had a lot of \ns. I tried to stop the matching before the first appearance of the pattern \n \n\n, but the regex always returned me some more characters.

Here's an example.

The string was:

contents = '\n\n\n\n\n\n\n\nClient: ABC area: Location Mc\nHole: 33-44   \n \n\n \n\nKJK TechCen    Rep # 5243 \n \n\n \n\n95 \n\nTable 3.1:  Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF \n \n\n PressRel V \n% \n\nLiq/To \n% \n\nLiq/Sat \nBu \n\nDenCom'

The regex I used was:

re.findall(r'Table *\d.+:* *[a-zA-Z0-9 :&–=\n%@,()°-]+ [^ \n \n\n ]', contents)

I wanted the resulting string to start from 'Table XXX' and end right before the first ' \n \n\n ', like this:

'Table 3.1:  Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF '

But the actual string I got was:

'Table 3.1:  Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF \n \n\n PressRel V'

So how could I modify the regex to get rid of the annoying '\n \n\n PressRel V'?

1
  • 1
    Then use a lookahead, or a capturing group, Table *\d.+:* *[a-zA-Z0-9 :&–=\n%@,()°-]+(?= \n \n\n ), see demo, or this demo. Commented Mar 27, 2019 at 13:57

2 Answers 2

1

Instead of using a character class, you might use a positive lookahead (?= to assert that what should follow is directly on the right.

Table *\d.+:* *[a-zA-Z0-9 :&–=\n%@,()°-]+(?= \n \n\n )

Regex demo

Or you could capture your values in a group and match the newlines following

(Table *\d.+:* *[a-zA-Z0-9 :&–=\n%@,()°-]+) \n \n\n 

Regex demo using a group

Sign up to request clarification or add additional context in comments.

Comments

1

You need a non-greedy +? instead of +, since all characters that appear in the end sequence are in the middle brackets.

end = r' \n \n\n '
result = re.findall(r'Table[^:]*:[a-zA-Z0-9 :&–=\n%@,()°-]+?' + end, contents)
#result = ['Table 3.1:  Joined Liq L1 (P = 40 \n@ 12), Test With 2 % \n\noF \n \n\n ']

# to chop off the end, if needed:
result = [x[:-len(end)] for x in result]

The [^ \n \n\n ] part in your example is equal to [^ \n], "A character that is not a newline or a space"

1 Comment

This method also worked well. Thank you for the explanation of [^ \n \n\n ].

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.