Parsing HTML File using Python: the starting point [duplicate]

Question

I have the html file in the format of following. I want to parse it using python. However, I am ignorant of using the xml module. your suggestions are highly welcome.

Note: sorry for my ignorant again.The question is not specific. However, since i have been frustrated with such parsing script, i do want to get a concrete answer which is described by the answer person (thank you all) as the starting point. Hope you can understand.

<html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Weibo Landscape: Historical Archive of 800 Verified Accounts</title>
    </head>
    <body>
<div><br>
related 1-th-weibo:<br>
mid:3365546399651413<br>
score:-5.76427445942 <br>
uid:1893278624 <br>
link:<a href="http://weibo.com/1893278624/xrv9ZEuLX"  target="_blank">source</a> <br>
time:Thu Oct 06 17:10:59 +0800 2011 <br>
content: Zuccotti Park。 <br>
<br></div>
<div><br>
related 2-th-weibo:<br>
mid:3366839418074456<br>
score:-5.80535767804 <br>
uid:1813080181 <br>
link:<a href="http://weibo.com/1813080181/xs2NvxSxa"  target="_blank">source</a> <br>
time:Mon Oct 10 06:48:53 +0800 2011 <br>
content:rt the tweet <br>
rtMid:3366833975690765 <br>
rtUid:1893801487 <br>
rtContent:#ows#here is the content and the link http://t.cn/aFLBgr <br>
<br></div>

    </body>
    </html>

Possible Duplicate:
Extracting text from HTML file using Python

There are many questions on SO about parsing HTML with Python. Please spend a couple of minutes searching around. In the question linked above, see the example with HTMLParser — Eli Bendersky
– Eli Bendersky, Commented May 2, 2012 at 7:14
Sure. I have searched, it's not what i want. I want the result to be more structured, rather than just convert it to text. — Frank Wang
– Frank Wang, Commented May 2, 2012 at 7:31
This is just one example - there are several Qs and As about HTML parsing: stackoverflow.com/search?q=python%20html%20parse — Eli Bendersky
– Eli Bendersky, Commented May 2, 2012 at 7:35
@FrankWANG: Have you decided what you want to extract? What have you tried? If you are looking for a starting point then there are many other Q&As to set you up. Your question is currently too general and you don't appear to have made any effort yourself. — MattH
– MattH, Commented May 2, 2012 at 9:18
@MattH, thank you for your reminding. I have tried to write a parser using xml module and the lxml module. — Frank Wang
– Frank Wang, Commented May 2, 2012 at 10:38

HAL · Accepted Answer · 2012-05-02 07:56:37Z

3

I suggest that you take a look at the Python library BeautifulSoup. It helps you with navigating and searching HTML data.

answered May 2, 2012 at 7:56

HAL

2,10117 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

daedalus · Accepted Answer · 2012-05-02 21:39:33Z

I did this as an exercise. It should get you on the right track, if this is still useful.

# -*- coding: utf-8 -*-

from BeautifulSoup import BeautifulSoup


html = '''<html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Weibo Landscape: Historical Archive of 800 Verified Accounts</title>
    </head>
    <body>
<div><br>
related 1-th-weibo:<br>
mid:3365546399651413<br>
score:-5.76427445942 <br>
uid:1893278624 <br>
link:<a href="http://weibo.com/1893278624/xrv9ZEuLX"  target="_blank">source</a> <br>
time:Thu Oct 06 17:10:59 +0800 2011 <br>
content: Zuccotti Park。 <br>
<br></div>
<div><br>
related 2-th-weibo:<br>
mid:3366839418074456<br>
score:-5.80535767804 <br>
uid:1813080181 <br>
link:<a href="http://weibo.com/1813080181/xs2NvxSxa"  target="_blank">source</a> <br>
time:Mon Oct 10 06:48:53 +0800 2011 <br>
content:rt the tweet <br>
rtMid:3366833975690765 <br>
rtUid:1893801487 <br>
rtContent:#ows#here is the content and the link http://t.cn/aFLBgr <br>
<br></div>

    </body>
    </html>'''

data = []
soup = BeautifulSoup(html)
divs = soup.findAll('div')
for div in divs:
    div_string = str(div)
    div_string = div_string.replace('<br />', '')
    div_list = div_string.split('\n')
    div_list = div_list[1:-1]
    record = []
    for item in div_list:
        record.append( tuple(item.split(':', 1)) )
    data.append(record)

for record in data:
    for field in record:
        print field
    print '--------------'

With your sample data, you will get this output. Further processing should be easy to massage into any structure that you want.

('related 1-th-weibo', '')
('mid', '3365546399651413')
('score', '-5.76427445942 ')
('uid', '1893278624 ')
('link', '<a href="http://weibo.com/1893278624/xrv9ZEuLX" target="_blank">source</a> ')
('time', 'Thu Oct 06 17:10:59 +0800 2011 ')
('content', ' Zuccotti Park\xe3\x80\x82 ')
--------------
('related 2-th-weibo', '')
('mid', '3366839418074456')
('score', '-5.80535767804 ')
('uid', '1813080181 ')
('link', '<a href="http://weibo.com/1813080181/xs2NvxSxa" target="_blank">source</a> ')
('time', 'Mon Oct 10 06:48:53 +0800 2011 ')
('content', 'rt the tweet ')
('rtMid', '3366833975690765 ')
('rtUid', '1893801487 ')
('rtContent', '#ows#here is the content and the link http://t.cn/aFLBgr ')

Collectives™ on Stack Overflow

Parsing HTML File using Python: the starting point [duplicate]

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related