write a python regex to match multiple urls in a html source page using beautifulsoup

Question

i am working on web scraping using beautifulsoup and trying to get links in a html page for given list of urls.

suppose if i want to get facebook and twitter links in a page, I tried

urls_list = ['www.facebook.com','www.apps.facebook.com', 'www.twitter.com']
reg = re.compile(i for i in urls_list)
print soup('a',{'href':reg})

and

soup = BeautifulSoup(html_source)
reg = re.compile(r"(http|https)://(www.[apps.]facebook|twitter).com/\w+")
print soup('a',{'href':reg})

above code is not working and retrieving all urls in a page. please bear with my little knowledge in regex and python

Martijn Pieters · Accepted Answer · 2014-01-22 14:42:39Z

1

You need to produce a valid regular expression:

reg = re.compile(r"^https?://www\.(apps\.)?(facebook|twitter)\.com/[\w-]+")

Quick demo:

>>> reg = re.compile(r"^https?://www\.(apps\.)?(facebook|twitter)\.com/[\w-]+")
>>> reg.search('https://www.apps.facebook.com/hello_world')
<_sre.SRE_Match object at 0x105fe39b0>
>>> reg.search('http://www.facebook.com/hello_world')
<_sre.SRE_Match object at 0x105fe3918>
>>> reg.search('http://www.twitter.com/hello_world')
<_sre.SRE_Match object at 0x105fe39b0>
>>> reg.search('http://www.twitters.com/')
>>> reg.search('http://www.twitter.com/')
>>> reg.search('http://twitter.com/hello')

The syntax [...] creates a character class; anything within that class matches; [apps.] is the same as [aps.] in that it'll match either an a, a p, an s or a . dot. Outside of character classes, . matches any character.

edited Jan 22, 2014 at 14:42

answered Jan 22, 2014 at 14:09

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

georg Over a year ago

(http|https) == https?

Martijn Pieters Over a year ago

@thg435: I missed that one because I was fixing the more glaring errors. :-)

user2695817 Over a year ago

thanks to you both.. Can you help me a little more.. I need to accept that last string also but only for one domain not for second domain. i.e.,'twitter.com/hello' not for 'facebook.com/hello'.

Martijn Pieters Over a year ago

@user2695817: You mean like r'^https?://www\.(apps\.)?(facebook\.com/|twitter\.com/[\w-]+)$'?

Martijn Pieters Over a year ago

@user2695817: That matches http://www.facebook.com/ but not http://www.facebook.com/hello.

|

Collectives™ on Stack Overflow

write a python regex to match multiple urls in a html source page using beautifulsoup

1 Answer 1

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related