0

I have an html page that looks like this

<tr>
    <td align=left>
        <a href="history/2c0b65635b3ac68a4d53b89521216d26.html">
            <img src="/images/page.gif" border="0" title="полная информация о документе" width=20 height=20>
        </a> 
        <a href="history/2c0b65635b3ac68a4d53b89521216d26_0.html" title="C.">Th</a>
    </td>
</tr>
<tr align=right>
    <td align=left>
        <a href="marketing/3c0a65635b2bc68b5c43b88421306c37.html">
            <img src="/images/page.gif" border="0" title="полная информация о документе" width=20 height=20>
        </a> 
        <a href="marketing/3c0a65635b2bc68b5c43b88421306c37_0.html" title="b">aa</a>
    </td>
</tr>

I need to get the text

history/2c0b65635b3ac68a4d53b89521216d26.html marketing/3c0a65635b2bc68b5c43b88421306c37.html

I wrote a script in python that uses regular expressions

import re
a = re.compile("[0-9 a-z]{0,15}/[0-9 a-f]{32}.html")
print(a.match(s))

where s's value is the html page above. However when I use this script I get "None". Where did I go wrong?

1
  • Instead of regex try using BeautifulSoup. Commented Dec 27, 2014 at 6:11

2 Answers 2

3

Don't use regex for parsing HTML content.

Use a specialized tool - an HTML Parser.

Example (using BeautifulSoup):

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup

data = u"""Your HTML here"""

soup = BeautifulSoup(data)
for link in soup.select('td a[href]'):
    print link['href']

Prints:

history/2c0b65635b3ac68a4d53b89521216d26.html
history/2c0b65635b3ac68a4d53b89521216d26_0.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
marketing/3c0a65635b2bc68b5c43b88421306c37_0.html

Or, if you want to get the href values that follow a pattern, use:

import re

for link in soup.find_all('a', href=re.compile(r'\w+/\w{32}\.html')):
    print link['href']

where r'\w+/\w{32}\.html' is a regular expression that would be applied to an href attribute of every a tag found. It would match one or more alphanumeric characters (\w+), followed by a slash, followed by exactly 32 alphanumeric characters (\w{32}), followed by a dot (\.- needs to be escaped), followed by html.

DEMO.

Sign up to request clarification or add additional context in comments.

Comments

2

You can also write something like

>>> soup = BeautifulSoup(html) #html is the string containing the data to be parsed
>>> for a in soup.select('a'):
...     print a['href']
... 
history/2c0b65635b3ac68a4d53b89521216d26.html
history/2c0b65635b3ac68a4d53b89521216d26_0.html
marketing/3c0a65635b2bc68b5c43b88421306c37.html
marketing/3c0a65635b2bc68b5c43b88421306c37_0.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.