2

I have a weird parsing problem with python. I need to parse the following text.

Here I need only the section between(not including) "pre" tag and column of numbers (starting with 205 4 164). I have several pages in this format.

<html>
<pre>


A Short Study of Notation Efficiency

CACM August, 1960

Smith Jr., H. J.

CA600802 JB March 20, 1978  9:02 PM

205 4   164
210 4   164
214 4   164
642 4   164
1   5   164

</pre>
</html>
3
  • What parts are you trying to parse? What result format are you seeking? Commented Apr 9, 2012 at 23:04
  • I just want this part: A Short Study of Notation Efficiency CACM August, 1960 Smith Jr., H. J. CA600802 JB March 20, 1978 9:02 PM Commented Apr 9, 2012 at 23:07
  • The part between <pre> and column of numbers. I am good with a string. From there I can work. Thanks. Commented Apr 9, 2012 at 23:08

4 Answers 4

3

Quazi, this calls out for a regex, specifically <pre>(.+?)(?:\d+\s+){3} with the DOTALL flag enabled.

You can find out about how to use regex in Python at http://docs.python.org/library/re.html and if you do a lot of this sort of string extraction, you'll be very glad you did. Going over my provided regex piece-by-piece:

<pre> just directly matches the pre tag
(.+?) matches and captures any characters
(?:\d+\s+){3} matches against some numbers followed by some whitespace, three times in a row

Sign up to request clarification or add additional context in comments.

1 Comment

@minitech, thanks for the correction! I hadn't noticed that SO had gobbled my pre tag.
2

Here's a regular expression to do that:

findData = re.compile('(?<=<pre>).+?(?=[\d\s]*</pre>)', re.S)

# ...

result = findData.search(data).group(0).strip()

Here's a demo.

3 Comments

Not exactly what the OP wants - based on the comments to the OP he only wants the four text lines before the columns of numbers.
@Li-aungYip: And that's exactly what this code does. Just not as a list, like yours. Is that the problem?
Your regex group(0) includes the columns of numbers. See the output.
2

I'd probably use lxml or BeautifulSoup. IMO, regex's are heavily overused, especially for parsing up HTML.

Comments

1

Other people have offered up regex solutions, which are good but may behave unexpectedly at times.

If the pages are exactly as shown in your example, that is:

  • No other HTML tags are present - only the <html> and <pre> tags
  • The number of lines is always consistent
  • The spacing between lines is always consistent

Then a simple approach like like this will do:

my_text = """<html>
<pre>


A Short Study of Notation Efficiency

CACM August, 1960

Smith Jr., H. J.

CA600802 JB March 20, 1978  9:02 PM

205 4   164
210 4   164
214 4   164
642 4   164
1   5   164

</pre>
</html>"""

lines = my_text.split("\n")

title   = lines[4]
journal = lines[6]
author  = lines[8]
date    = lines[10]

If you can't guarantee the spacing between lines, but you can guarantee that you only want the first four non-whitespace lines inside the <html><pre>;

import pprint

max_extracted_lines = 4
extracted_lines = []
for line in lines:
    if line == "<html>" or line == "<pre>":
        continue
    if line:
        extracted_lines.append(line)
    if len(extracted_lines) >= max_extracted_lines:
        break

pprint.pprint(extracted_lines)

Giving output:

['A Short Study of Notation Efficiency',
 'CACM August, 1960',
 'Smith Jr., H. J.',
 'CA600802 JB March 20, 1978  9:02 PM']

Don't use regex where simple string operations will do.

3 Comments

I couldn't disagree more; regexes do not "behave unexpectedly at times", they follow very straightforward rules. On the other hand, making unnecessary assumptions about minor details of the format of the input data is likely to backfire.
Thanks for the alternate approach unfortunately there really is not a way to ensure that all the factors will be consistant. But other people's regex has worked well. Thank you for your time.
@QuaziFarhan: no worries. As with all things, you should use the simplest approach that works - but no simpler. This approach is evidently a little too simplistic. ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.