2

sorry i am new HTML, pleases understand though my question is trivial.

i want to build simple search engine using python.

for that, first, i need to build a crawler to get linked URLs.

and i want to use regular expression to extract linked URLs.

so i did study, but i don't know the exact pattern for link in HTML.

from urllib import urlopen
import re

webPage = urlopen('http://web.cs.dartmouth.edu/').read()
linkedPage = re.findall(r'what should be filled in here?', webPage)

1 Answer 1

4

There are tools specifically for parsing HTML - these are called HTML Parsers.

Example, using BeautifulSoup:

from urllib2 import urlopen
from bs4 import BeautifulSoup

soup = BeautifulSoup(urlopen('http://web.cs.dartmouth.edu/'))
for article in soup.select('div.view-content article'):
    print article.text

Prints all of the articles on the page:

Prof Sean Smith receives best paper of 2014 award
...
Lorenzo Torresani wins the Google Faculty Research Award
...

Also see the reasons why using regex for parsing HTML should be avoided:

Sign up to request clarification or add additional context in comments.

1 Comment

So if i want to extract linked URLs in the webpage using BeautifulSoup, how can i use it ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.