38

I’m a newbie in Python. I’m learning regexes, but I need help here.

Here comes the HTML source:

<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>

I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?

3
  • 2
    Duplicate: stackoverflow.com/questions/430966/regex-for-links-in-html-text Commented Jan 31, 2009 at 22:04
  • 6
    I've been away from SO for a while, it's good to see I've missed nothing, and people are STILL asking how to parse HTML with regex every damn day. Commented Feb 1, 2009 at 2:30
  • 2
    @bobince Multiple times a day, it is so bad I created two questions that I can redirect people to and a form answer that points them there. Commented May 13, 2009 at 14:30

10 Answers 10

86

If you're only looking for one:

import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
    print(match.group(1))

If you have a long string, and want every instance of the pattern in it:

import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))

Where s is the string that you're looking for matches in.

Quick explanation of the regexp bits:

r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)

"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.

Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."

"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.

The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.

It's pretty easy to do:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
    print(tag['href'])

Once you've installed BeautifulSoup, anyway.

Sign up to request clarification or add additional context in comments.

6 Comments

Part of learning regexes is learning when not to use them, this is a case where you shouldn't use them.
some pages are so badly formatted that even BeautifulSoup can't find the links in there. Then you have to resort to something.
Small improvement to the regexp: re.findall(r'href\s?=\s?[\'"]?([^\'" >]+)', show_notes), which allows a space before and/or after the equals sign.
Are you sure it is "match.group(0)" instead of "match.group(1)"?
Would it not make more sense, and is it not more correct, to write if match: as if match is not None: instead?
|
13

Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.

Comments

13

this should work, although there might be more elegant ways.

import re
url='<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)

1 Comment

(?<=href=["']).*?(?=["']) takes care of single quoated href also
12

John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:

http://daringfireball.net/2009/11/liberal_regex_for_matching_urls

If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.

Comments

3

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.

Comments

3

this regex can help you, you should get the first group by \1 or whatever method you have in your language.

href="([^"]*)

example:

<a href="http://www.amghezi.com">amgheziName</a>

result:

http://www.amghezi.com

Comments

2

There's tonnes of them on regexlib

Comments

1

Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.

Comments

1

This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/

(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)

Oputput:

Match 1. /wiki/Main_Page

Match 2. /wiki/Portal:Contents

Match 3. /wiki/Portal:Featured_content

Match 4. /wiki/Portal:Current_events

Match 5. /wiki/Special:Random

Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en

1 Comment

When entering this regular expression in a python program (not through the site you mentioned) it will give an error due to the usage of text quotation marks ' or ". To fix this the regex should be: regex='(?:href=[\'"])([:/.A-z?<_&\s=>0-9;-]+)' by adding a slant \ before the ' or the ".
-1

You can use this.

<a[^>]+href=["'](.*?)["']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.