Complicated parsing in python

Question

I have a weird parsing problem with python. I need to parse the following text.

Here I need only the section between(not including) "pre" tag and column of numbers (starting with 205 4 164). I have several pages in this format.

<html>
<pre>


A Short Study of Notation Efficiency

CACM August, 1960

Smith Jr., H. J.

CA600802 JB March 20, 1978  9:02 PM

205 4   164
210 4   164
214 4   164
642 4   164
1   5   164

</pre>
</html>

What parts are you trying to parse? What result format are you seeking? — sblom
– sblom, Commented Apr 9, 2012 at 23:04
I just want this part: A Short Study of Notation Efficiency CACM August, 1960 Smith Jr., H. J. CA600802 JB March 20, 1978 9:02 PM — Quazi Farhan
– Quazi Farhan, Commented Apr 9, 2012 at 23:07
The part between <pre> and column of numbers. I am good with a string. From there I can work. Thanks. — Quazi Farhan
– Quazi Farhan, Commented Apr 9, 2012 at 23:08

Ry- · Accepted Answer · 2012-04-09 23:53:45Z

3

Quazi, this calls out for a regex, specifically <pre>(.+?)(?:\d+\s+){3} with the DOTALL flag enabled.

You can find out about how to use regex in Python at http://docs.python.org/library/re.html and if you do a lot of this sort of string extraction, you'll be very glad you did. Going over my provided regex piece-by-piece:

<pre> just directly matches the pre tag
(.+?) matches and captures any characters
(?:\d+\s+){3} matches against some numbers followed by some whitespace, three times in a row

edited Apr 9, 2012 at 23:53

Ry-♦

226k56 gold badges496 silver badges504 bronze badges

answered Apr 9, 2012 at 23:21

DSimon

3,4202 gold badges23 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

DSimon Over a year ago

@minitech, thanks for the correction! I hadn't noticed that SO had gobbled my pre tag.

Ry- · Accepted Answer · 2012-04-10 00:37:50Z

2

Here's a regular expression to do that:

findData = re.compile('(?<=<pre>).+?(?=[\d\s]*</pre>)', re.S)

# ...

result = findData.search(data).group(0).strip()

Here's a demo.

edited Apr 10, 2012 at 0:37

answered Apr 9, 2012 at 23:25

Ry-♦

226k56 gold badges496 silver badges504 bronze badges

3 Comments

Li-aung Yip Over a year ago

Not exactly what the OP wants - based on the comments to the OP he only wants the four text lines before the columns of numbers.

Ry- Over a year ago

@Li-aungYip: And that's exactly what this code does. Just not as a list, like yours. Is that the problem?

Li-aung Yip Over a year ago

Your regex group(0) includes the columns of numbers. See the output.

user1277476 · Accepted Answer · 2012-04-09 23:42:06Z

2

I'd probably use lxml or BeautifulSoup. IMO, regex's are heavily overused, especially for parsing up HTML.

answered Apr 9, 2012 at 23:42

user1277476

2,90714 silver badges10 bronze badges

Comments

Li-aung Yip · Accepted Answer · 2012-04-10 00:10:21Z

1

Other people have offered up regex solutions, which are good but may behave unexpectedly at times.

If the pages are exactly as shown in your example, that is:

No other HTML tags are present - only the <html> and <pre> tags
The number of lines is always consistent
The spacing between lines is always consistent

Then a simple approach like like this will do:

my_text = """<html>
<pre>


A Short Study of Notation Efficiency

CACM August, 1960

Smith Jr., H. J.

CA600802 JB March 20, 1978  9:02 PM

205 4   164
210 4   164
214 4   164
642 4   164
1   5   164

</pre>
</html>"""

lines = my_text.split("\n")

title   = lines[4]
journal = lines[6]
author  = lines[8]
date    = lines[10]

If you can't guarantee the spacing between lines, but you can guarantee that you only want the first four non-whitespace lines inside the <html><pre>;

import pprint

max_extracted_lines = 4
extracted_lines = []
for line in lines:
    if line == "<html>" or line == "<pre>":
        continue
    if line:
        extracted_lines.append(line)
    if len(extracted_lines) >= max_extracted_lines:
        break

pprint.pprint(extracted_lines)

Giving output:

['A Short Study of Notation Efficiency',
 'CACM August, 1960',
 'Smith Jr., H. J.',
 'CA600802 JB March 20, 1978  9:02 PM']

Don't use regex where simple string operations will do.

answered Apr 10, 2012 at 0:10

Li-aung Yip

12.5k5 gold badges36 silver badges51 bronze badges

3 Comments

DSimon Over a year ago

I couldn't disagree more; regexes do not "behave unexpectedly at times", they follow very straightforward rules. On the other hand, making unnecessary assumptions about minor details of the format of the input data is likely to backfire.

Quazi Farhan Over a year ago

Thanks for the alternate approach unfortunately there really is not a way to ensure that all the factors will be consistant. But other people's regex has worked well. Thank you for your time.

Li-aung Yip Over a year ago

@QuaziFarhan: no worries. As with all things, you should use the simplest approach that works - but no simpler. This approach is evidently a little too simplistic. ;)

Collectives™ on Stack Overflow

Complicated parsing in python

4 Answers 4

1 Comment

3 Comments

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

3 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related