0

i am working on web scraping using beautifulsoup and trying to get links in a html page for given list of urls.

suppose if i want to get facebook and twitter links in a page, I tried

urls_list = ['www.facebook.com','www.apps.facebook.com', 'www.twitter.com']
reg = re.compile(i for i in urls_list)
print soup('a',{'href':reg})

and

soup = BeautifulSoup(html_source)
reg = re.compile(r"(http|https)://(www.[apps.]facebook|twitter).com/\w+")
print soup('a',{'href':reg})

above code is not working and retrieving all urls in a page. please bear with my little knowledge in regex and python

1 Answer 1

1

You need to produce a valid regular expression:

reg = re.compile(r"^https?://www\.(apps\.)?(facebook|twitter)\.com/[\w-]+")

Quick demo:

>>> reg = re.compile(r"^https?://www\.(apps\.)?(facebook|twitter)\.com/[\w-]+")
>>> reg.search('https://www.apps.facebook.com/hello_world')
<_sre.SRE_Match object at 0x105fe39b0>
>>> reg.search('http://www.facebook.com/hello_world')
<_sre.SRE_Match object at 0x105fe3918>
>>> reg.search('http://www.twitter.com/hello_world')
<_sre.SRE_Match object at 0x105fe39b0>
>>> reg.search('http://www.twitters.com/')
>>> reg.search('http://www.twitter.com/')
>>> reg.search('http://twitter.com/hello')

The syntax [...] creates a character class; anything within that class matches; [apps.] is the same as [aps.] in that it'll match either an a, a p, an s or a . dot. Outside of character classes, . matches any character.

Sign up to request clarification or add additional context in comments.

9 Comments

(http|https) == https?
@thg435: I missed that one because I was fixing the more glaring errors. :-)
thanks to you both.. Can you help me a little more.. I need to accept that last string also but only for one domain not for second domain. i.e.,'twitter.com/hello' not for 'facebook.com/hello'.
@user2695817: You mean like r'^https?://www\.(apps\.)?(facebook\.com/|twitter\.com/[\w-]+)$'?
@user2695817: That matches http://www.facebook.com/ but not http://www.facebook.com/hello.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.