crawling web page through python regular expression

Question

sorry i am new HTML, pleases understand though my question is trivial.

i want to build simple search engine using python.

for that, first, i need to build a crawler to get linked URLs.

and i want to use regular expression to extract linked URLs.

so i did study, but i don't know the exact pattern for link in HTML.

from urllib import urlopen
import re

webPage = urlopen('http://web.cs.dartmouth.edu/').read()
linkedPage = re.findall(r'what should be filled in here?', webPage)

Community · Accepted Answer · 2017-05-23 12:16:10Z

4

There are tools specifically for parsing HTML - these are called HTML Parsers.

Example, using BeautifulSoup:

from urllib2 import urlopen
from bs4 import BeautifulSoup

soup = BeautifulSoup(urlopen('http://web.cs.dartmouth.edu/'))
for article in soup.select('div.view-content article'):
    print article.text

Prints all of the articles on the page:

Prof Sean Smith receives best paper of 2014 award
...
Lorenzo Torresani wins the Google Faculty Research Award
...

Also see the reasons why using regex for parsing HTML should be avoided:

RegEx match open tags except XHTML self-contained tags

edited May 23, 2017 at 12:16

CommunityBot

11 silver badge

answered Aug 29, 2014 at 13:57

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

SangminKim Over a year ago

So if i want to extract linked URLs in the webpage using BeautifulSoup, how can i use it ?

Collectives™ on Stack Overflow

crawling web page through python regular expression

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related