0

I'm working on a regular expression and was wondering how to extract URL from a HTML page. I want to print out the url from this line:

Website is: http://www.somesite.com 

Everytime that link is found, I want to just extract what URL is there after **Website is:** Any help will be appreciated.

0

4 Answers 4

2

Will this suffice or do you need to be more specific?

In [230]: s = 'Website is: http://www.somesite.com '
In [231]: re.findall('Website is:\s+(\S+)', s)
Out[231]: ['http://www.somesite.com']
Sign up to request clarification or add additional context in comments.

5 Comments

This is the better answer but has pitfalls if there are <a>/</a> tags around the url.
I must admit if it was me I'd be using one of those url extraction ReExp recipes that are on Google. I did the simplest thing that would work.
Thanks. I tried this and it worked. Thanks to others for their massive input as well :)
@sotapme the problem is that HTML really isn't conducive to the use of regular expressions. There are libraries which will parse it for you like BeautifulSoup which would make handling this far less error-prone.
Whilst I agrees in principle that groking HTML using regexps is usually a bad idea the OP was very specific in what the text looked like and as such it really was just a blob of text, granted if it was doing re across HTML as a structured document then it would be a bad idea. If I was the OP I might have been tempted to take the HTML and just grab text() from the document to eliminate any markup.
0

You could match each line to a regular expression with a capturing group, like so:

for l in page:
    m = re.match("Website is: (.*)")
    if m:
        print m.groups()[0]

This would both check if each line matched the pattern, and extract the link from it.

A few pitfalls:

  1. This assumes that the "Website is" expression is always at the start of the line. If it's not, you could use re.search.

  2. This assumes there is exactly one space between the colon and the website. If that's not true, you could change the expression to something like Website is:\s+(http.*).

The specifics will depend on the page you are trying to parse.

Comments

0

Regex might be overkill for this since it's so simple.

def main():
    urls = []
    file = prepare_file("<yourfile>.html")
    for i in file:
         if "www" in i or "http://" in i:
             urls.append(i)
    return urls


def prepare_file(filename):
    file = open(filename)
    a = file.readlines() #splits on new lines
    a = [ i.strip() for i in [ x for x in a ] ] #remove white space
    a = filter(lambda x : x != '', a) #remove empty elements
    return a

Comments

0

URL are awkward to capture with regex, according to what I've read

Probably using the following regex pattern will be good for you:

pat = 'Website is: (%s)' % fireball

where fireball is a pattern to catch URLs that you'll find here:

daringfireball.net/2010/07/improved_regex_for_matching_urls

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.