Regular expression to extract URL from an HTML link

Question

I’m a newbie in Python. I’m learning regexes, but I need help here.

Here comes the HTML source:

<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>

I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?

Duplicate: stackoverflow.com/questions/430966/regex-for-links-in-html-text — S.Lott
– S.Lott, Commented Jan 31, 2009 at 22:04
I've been away from SO for a while, it's good to see I've missed nothing, and people are STILL asking how to parse HTML with regex every damn day. — bobince
– bobince, Commented Feb 1, 2009 at 2:30
@bobince Multiple times a day, it is so bad I created two questions that I can redirect people to and a form answer that points them there. — Chas. Owens
– Chas. Owens, Commented May 13, 2009 at 14:30

Community · Accepted Answer · 2020-06-20 09:12:55Z

86

If you're only looking for one:

import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
    print(match.group(1))

If you have a long string, and want every instance of the pattern in it:

import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))

Where s is the string that you're looking for matches in.

Quick explanation of the regexp bits:

r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)

"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.

Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."

"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.

The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.

It's pretty easy to do:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
    print(tag['href'])

Once you've installed BeautifulSoup, anyway.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Jan 31, 2009 at 19:17

David

3,9662 gold badges25 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Chas. Owens Over a year ago

Part of learning regexes is learning when not to use them, this is a case where you shouldn't use them.

Petter H Over a year ago

some pages are so badly formatted that even BeautifulSoup can't find the links in there. Then you have to resort to something.

Leon Overweel Over a year ago

Small improvement to the regexp: re.findall(r'href\s?=\s?[\'"]?([^\'" >]+)', show_notes), which allows a space before and/or after the equals sign.

pah8J Over a year ago

Are you sure it is "match.group(0)" instead of "match.group(1)"?

blizz Over a year ago

Would it not make more sense, and is it not more correct, to write if match: as if match is not None: instead?

|

JosefAssad · Accepted Answer · 2009-01-31 19:13:16Z

13

Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.

answered Jan 31, 2009 at 19:13

JosefAssad

4,12830 silver badges37 bronze badges

Comments

jannis · Accepted Answer · 2009-01-31 19:16:03Z

13

this should work, although there might be more elegant ways.

import re
url='<a href="http://www.ptop.se" target="_blank">http://www.ptop.se</a>'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)

answered Jan 31, 2009 at 19:16

jannis

1 Comment

Neil Over a year ago

(?<=href=["']).*?(?=["']) takes care of single quoated href also

Paul D. Waite · Accepted Answer · 2009-11-27 23:37:54Z

12

John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:

http://daringfireball.net/2009/11/liberal_regex_for_matching_urls

If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.

answered Nov 27, 2009 at 23:37

Paul D. Waite

99.5k57 gold badges204 silver badges275 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:09:39Z

3

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.

edited May 23, 2017 at 12:09

CommunityBot

11 silver badge

answered May 13, 2009 at 14:38

Chas. Owens

65.1k25 gold badges139 silver badges232 bronze badges

Comments

Hamedz · Accepted Answer · 2017-03-08 22:39:44Z

3

this regex can help you, you should get the first group by \1 or whatever method you have in your language.

href="([^"]*)

example:

<a href="http://www.amghezi.com">amgheziName</a>

result:

http://www.amghezi.com

answered Mar 8, 2017 at 22:39

Hamedz

72615 silver badges27 bronze badges

Comments

Chris S · Accepted Answer · 2009-01-31 19:34:19Z

2

There's tonnes of them on regexlib

answered Jan 31, 2009 at 19:34

Chris S

65.6k53 gold badges225 silver badges240 bronze badges

Comments

Jarek · Accepted Answer · 2009-05-13 14:22:29Z

1

Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.

answered May 13, 2009 at 14:22

Jarek

Comments

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/

(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)

Oputput:

Match 1. /wiki/Main_Page

Match 2. /wiki/Portal:Contents

Match 3. /wiki/Portal:Featured_content

Match 4. /wiki/Portal:Current_events

Match 5. /wiki/Special:Random

Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered May 20, 2016 at 6:07

Rohit Malgaonkar

5235 silver badges5 bronze badges

1 Comment

Mohammad ElNesr Over a year ago

When entering this regular expression in a python program (not through the site you mentioned) it will give an error due to the usage of text quotation marks ' or ". To fix this the regex should be: regex='(?:href=[\'"])([:/.A-z?<_&\s=>0-9;-]+)' by adding a slant \ before the ' or the ".

arjan · Accepted Answer · 2018-04-24 07:50:23Z

-1

You can use this.

<a[^>]+href=["'](.*?)["']

answered Apr 24, 2018 at 7:50

arjan

1

Collectives™ on Stack Overflow

Regular expression to extract URL from an HTML link

10 Answers 10

6 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

6 Comments

Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related