Python Regular Expression for Extrating URL

Question

I'm working on a regular expression and was wondering how to extract URL from a HTML page. I want to print out the url from this line:

Website is: http://www.somesite.com

Everytime that link is found, I want to just extract what URL is there after **Website is:** Any help will be appreciated.

sotapme · Accepted Answer · 2013-02-18 16:39:57Z

2

Will this suffice or do you need to be more specific?

In [230]: s = 'Website is: http://www.somesite.com '
In [231]: re.findall('Website is:\s+(\S+)', s)
Out[231]: ['http://www.somesite.com']

answered Feb 18, 2013 at 16:39

sotapme

4,9432 gold badges21 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Ian Stapleton Cordasco Over a year ago

This is the better answer but has pitfalls if there are <a>/</a> tags around the url.

sotapme Over a year ago

I must admit if it was me I'd be using one of those url extraction ReExp recipes that are on Google. I did the simplest thing that would work.

Helen Neely Over a year ago

Thanks. I tried this and it worked. Thanks to others for their massive input as well :)

Ian Stapleton Cordasco Over a year ago

@sotapme the problem is that HTML really isn't conducive to the use of regular expressions. There are libraries which will parse it for you like BeautifulSoup which would make handling this far less error-prone.

sotapme Over a year ago

Whilst I agrees in principle that groking HTML using regexps is usually a bad idea the OP was very specific in what the text looked like and as such it really was just a blob of text, granted if it was doing re across HTML as a structured document then it would be a bad idea. If I was the OP I might have been tempted to take the HTML and just grab text() from the document to eliminate any markup.

David Robinson · Accepted Answer · 2013-02-18 16:40:09Z

0

You could match each line to a regular expression with a capturing group, like so:

for l in page:
    m = re.match("Website is: (.*)")
    if m:
        print m.groups()[0]

This would both check if each line matched the pattern, and extract the link from it.

A few pitfalls:

This assumes that the "Website is" expression is always at the start of the line. If it's not, you could use re.search.
This assumes there is exactly one space between the colon and the website. If that's not true, you could change the expression to something like Website is:\s+(http.*).

The specifics will depend on the page you are trying to parse.

answered Feb 18, 2013 at 16:40

David Robinson

78.8k16 gold badges172 silver badges189 bronze badges

Comments

The Internet · Accepted Answer · 2013-02-18 16:42:40Z

0

Regex might be overkill for this since it's so simple.

def main():
    urls = []
    file = prepare_file("<yourfile>.html")
    for i in file:
         if "www" in i or "http://" in i:
             urls.append(i)
    return urls


def prepare_file(filename):
    file = open(filename)
    a = file.readlines() #splits on new lines
    a = [ i.strip() for i in [ x for x in a ] ] #remove white space
    a = filter(lambda x : x != '', a) #remove empty elements
    return a

answered Feb 18, 2013 at 16:42

The Internet

8,11312 gold badges59 silver badges92 bronze badges

Comments

eyquem · Accepted Answer · 2013-02-18 16:50:43Z

0

URL are awkward to capture with regex, according to what I've read

Probably using the following regex pattern will be good for you:

pat = 'Website is: (%s)' % fireball

where fireball is a pattern to catch URLs that you'll find here:

daringfireball.net/2010/07/improved_regex_for_matching_urls

answered Feb 18, 2013 at 16:50

eyquem

27.8k7 gold badges43 silver badges46 bronze badges

Collectives™ on Stack Overflow

Python Regular Expression for Extrating URL

4 Answers 4

5 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related