Extract html data using regular expressions

Question

I have an html page that looks like this

<tr>
    <td align=left>
        <a href="history/2c0b65635b3ac68a4d53b89521216d26.html">
            <img src="/images/page.gif" border="0" title="полная информация о документе" width=20 height=20>
        </a> 
        <a href="history/2c0b65635b3ac68a4d53b89521216d26_0.html" title="C.">Th</a>
    </td>
</tr>
<tr align=right>
    <td align=left>
        <a href="marketing/3c0a65635b2bc68b5c43b88421306c37.html">
            <img src="/images/page.gif" border="0" title="полная информация о документе" width=20 height=20>
        </a> 
        <a href="marketing/3c0a65635b2bc68b5c43b88421306c37_0.html" title="b">aa</a>
    </td>
</tr>

I need to get the text

history/2c0b65635b3ac68a4d53b89521216d26.html marketing/3c0a65635b2bc68b5c43b88421306c37.html

I wrote a script in python that uses regular expressions

import re
a = re.compile("[0-9 a-z]{0,15}/[0-9 a-f]{32}.html")
print(a.match(s))

where s's value is the html page above. However when I use this script I get "None". Where did I go wrong?

Instead of regex try using BeautifulSoup.

f.rodrigues
– f.rodrigues

2014-12-27 06:11:47 +00:00
Commented Dec 27, 2014 at 6:11 — f.rodrigues
– f.rodrigues, Commented Dec 27, 2014 at 6:11

Community · Accepted Answer · 2017-05-23 10:27:31Z

Don't use regex for parsing HTML content.

Use a specialized tool - an HTML Parser.

Example (using BeautifulSoup):

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

data = u"""Your HTML here"""

soup = BeautifulSoup(data)
for link in soup.select('td a[href]'):
    print link['href']

Prints:

history/2c0b65635b3ac68a4d53b89521216d26.html
history/2c0b65635b3ac68a4d53b89521216d26_0.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
marketing/3c0a65635b2bc68b5c43b88421306c37_0.html

Or, if you want to get the href values that follow a pattern, use:

import re

for link in soup.find_all('a', href=re.compile(r'\w+/\w{32}\.html')):
    print link['href']

where r'\w+/\w{32}\.html' is a regular expression that would be applied to an href attribute of every a tag found. It would match one or more alphanumeric characters (\w+), followed by a slash, followed by exactly 32 alphanumeric characters (\w{32}), followed by a dot (\.- needs to be escaped), followed by html.

DEMO.

nu11p01n73R · Accepted Answer · 2014-12-27 06:14:27Z

2

You can also write something like

>>> soup = BeautifulSoup(html) #html is the string containing the data to be parsed
>>> for a in soup.select('a'):
...     print a['href']
... 
history/2c0b65635b3ac68a4d53b89521216d26.html
history/2c0b65635b3ac68a4d53b89521216d26_0.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
marketing/3c0a65635b2bc68b5c43b88421306c37_0.html

answered Dec 27, 2014 at 6:14

nu11p01n73R

26.8k3 gold badges42 silver badges52 bronze badges

Collectives™ on Stack Overflow

Extract html data using regular expressions

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related