Other people have offered up regex solutions, which are good but may behave unexpectedly at times.
If the pages are exactly as shown in your example, that is:
- No other HTML tags are present - only the
<html> and <pre> tags
- The number of lines is always consistent
- The spacing between lines is always consistent
Then a simple approach like like this will do:
my_text = """<html>
<pre>
A Short Study of Notation Efficiency
CACM August, 1960
Smith Jr., H. J.
CA600802 JB March 20, 1978 9:02 PM
205 4 164
210 4 164
214 4 164
642 4 164
1 5 164
</pre>
</html>"""
lines = my_text.split("\n")
title = lines[4]
journal = lines[6]
author = lines[8]
date = lines[10]
If you can't guarantee the spacing between lines, but you can guarantee that you only want the first four non-whitespace lines inside the <html><pre>;
import pprint
max_extracted_lines = 4
extracted_lines = []
for line in lines:
if line == "<html>" or line == "<pre>":
continue
if line:
extracted_lines.append(line)
if len(extracted_lines) >= max_extracted_lines:
break
pprint.pprint(extracted_lines)
Giving output:
['A Short Study of Notation Efficiency',
'CACM August, 1960',
'Smith Jr., H. J.',
'CA600802 JB March 20, 1978 9:02 PM']
Don't use regex where simple string operations will do.