I need to extract data from HTML-files. The files in question are, most likely, automatically generated. I have uploaded the code of one of these files to Pastebin: http://pastebin.com/9Nj2Edfv. This is the link to the actual page: http://eur-lex.europa.eu/Notice.do?checktexts=checkbox&val=60504%3Acs&pos=1&page=1&lang=en&pgs=10&nbl=1&list=60504%3Acs%2C&hwords=&action=GO&visu=%23texte
The data I need to extract is found under the different headings.
This is what I have so far:
from BeautifulSoup import BeautifulSoup
ecj_data = open("data\ecj_1.html",'r').read()
soup = BeautifulSoup(ecj_data)
celex = soup.find('h1')
auth_lang = soup('ul', limit=14)[13].li
procedure = soup('ul', limit=20)[17].li
print "Celex number:", celex.renderContents(),
print "Authentic language:", auth_lang
print "Type of procedure:", procedure
I have all the data stored locally which is the reason it opens the file ecj_1.html.
The Celex number and the Authentic language works somewhat good.
celex returns
"Celex number:
61977J0059"
auth_lang returns "Authentic language: <li>French</li>"
I need just the contents of the h1 tag (not the break at the end).
[Also, I need auth_lang to return just "French", and not the <li>-tags.]
This is not a problem anymore. I realized I could just add ".text" to the end of "auth_lang".
Procedure on the other hand returns this:
Type of procedure: <li>
<strong>Type of procedure:</strong>
<br />
Reference for a preliminary ruling
</li>
which is quite wrong as I just need it to return "Reference for a preliminary ruling".
Is there any way I can achieve this?
Second edit:
I replaced celex = soup.find('h1') with celex = soup('h1', limit=2)[0] and added .text to the print celex.