Extracting data from HTML-files with BeautifulSoup and Python

Question

I need to extract data from HTML-files. The files in question are, most likely, automatically generated. I have uploaded the code of one of these files to Pastebin: http://pastebin.com/9Nj2Edfv. This is the link to the actual page: http://eur-lex.europa.eu/Notice.do?checktexts=checkbox&val=60504%3Acs&pos=1&page=1&lang=en&pgs=10&nbl=1&list=60504%3Acs%2C&hwords=&action=GO&visu=%23texte

The data I need to extract is found under the different headings.

This is what I have so far:

from BeautifulSoup import BeautifulSoup
ecj_data = open("data\ecj_1.html",'r').read()

soup = BeautifulSoup(ecj_data)

celex = soup.find('h1')
auth_lang = soup('ul', limit=14)[13].li
procedure = soup('ul', limit=20)[17].li

print "Celex number:", celex.renderContents(),
print "Authentic language:", auth_lang
print "Type of procedure:", procedure

I have all the data stored locally which is the reason it opens the file ecj_1.html.

The Celex number and the Authentic language works somewhat good.

celex returns

"Celex number: 
61977J0059"

auth_lang returns "Authentic language: <li>French</li>"

I need just the contents of the h1 tag (not the break at the end).

[Also, I need auth_lang to return just "French", and not the <li>-tags.] This is not a problem anymore. I realized I could just add ".text" to the end of "auth_lang".

Procedure on the other hand returns this:

    Type of procedure: <li>
    <strong>Type of procedure:</strong>
    <br />
    Reference for a preliminary ruling
    </li>

which is quite wrong as I just need it to return "Reference for a preliminary ruling".

Is there any way I can achieve this?

Second edit: I replaced celex = soup.find('h1') with celex = soup('h1', limit=2)[0] and added .text to the print celex.

fraxel · Accepted Answer · 2012-03-20 14:42:31Z

4

The contents of each of the found sequences are lists, just the first two are length 1. However procedure is 5 elements long, and the entry you are after (in this case) is the 4th. I've used splitlines() to get rid of the newlines also.

print "Celex number:", celex.contents[0].splitlines()[1]
print "Authentic language:", auth_lang.contents[0].splitlines()[0]
print "Type of procedure:", procedure.contents[4].splitlines()[1]

output:

Celex number: 61977J0059
Authentic language: French
Type of procedure: Reference for a preliminary ruling

answered Mar 20, 2012 at 14:42

fraxel

35.4k11 gold badges101 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

A2D2 Over a year ago

Fraxel: Thank you very much! It works like a charm. The idea is to somehow transfer the output of this file to a database. I believe you may have solved a future problem when you showed me how to get rid of the newlines as they are likely to screw something up later on. Thanks again!

Collectives™ on Stack Overflow

Extracting data from HTML-files with BeautifulSoup and Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related