Checking if certain words are on a web page using Python?

Question

I have a list of words, in Python I need to loop through each word and check if the word is on a website.

Currently, this is a snippet of what I have (relating to this problem):

words = ['word', 'word1', 'word2']
site = urllib.request.urlopen(link)
for word in words:
    if word in site:
       print(word)
    else:
       print(word, "not found")

I have a list of words, I open the site, and I loop through each word checking for the word in the site. Note that I am using a website with all those words found on it (I set it up myself and I can verify it works) and the link is the url of the website.

The problem is, I always go to "word not found", and it never seems to find the words on the website.

What's wrong with the code? It seems to be a semantics error, because the syntax works fine, and there are no exceptions thrown (although in my final I do have exception handling, but it will still report if exceptions are thrown anyways).

@larsmans What do you mean by urllib.request.urlopen being a blatant error? What's wrong with it? — Bhaxy
– Bhaxy, Commented Nov 20, 2011 at 20:19
@larsmans: What's wrong with urllib.request.urlopen? Perhaps you're not familiar with Python 3's standard libraries? — Greg Hewgill
– Greg Hewgill, Commented Nov 20, 2011 at 20:19
@GregHewgill, Bhaxy: excuse me, misinterpreted my error messages. I'm not indeed not up to speed with the Python 3 library yet. — Fred Foo
– Fred Foo, Commented Nov 20, 2011 at 20:20

Greg Hewgill · Accepted Answer · 2011-11-20 20:19:23Z

7

The urlopen() function returns a "file-like object". In order to read the data, you must call read():

site = urllib.request.urlopen(link).read()

There are other ways to read the data too, but this is a simple way to load the whole page data into memory for quick searching.

The reason your code worked as written is because a file-like object is also iterable, which means it can be used with the in operator. But it wasn't doing what you wanted.

answered Nov 20, 2011 at 20:19

Greg Hewgill

1.0m192 gold badges1.2k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Bhaxy Over a year ago

Okay, so it seems to work now, and I've done print(site) and I've seen that it did seem to download the site, but the words that I put on the site are in the download. I used my code, and I also copied and pasted the result into notepad, and couldn't find the result there either. What's wrong?

Greg Hewgill Over a year ago

It's possible that the words you are looking for are not present in the HTML that is downloaded. Maybe they're added to the DOM later using Javascript when the page is loaded in a browser. Without more information about the page you are loading and the words you're looking for, it's difficult to provide a more specific answer.

Oliver · Accepted Answer · 2014-05-08 17:00:53Z

2

It also helps if you decode the links contents. Otherwise it is read as bytes. I had a similar problem. try

temp = urllib.request.urlopen(link)
HTML = temp.read().decode("utf-8")

this will decode the link using Unicode the link may not be encoded with Unicode. you can find out the encoding be requesting the header of the site

edited May 8, 2014 at 17:00

answered Nov 20, 2011 at 20:39

Oliver

816 bronze badges

3 Comments

Bhaxy Over a year ago

Thank you, this solves the question I asked in the comment to Greg Hewgill's answer.

Oliver Over a year ago

temp.getheader('Content-Type') should return info on the encode type. if that doesn't work just try utf-8 that usually works on English language sites

Epcylon Over a year ago

Just to clear up a common mistake in this. You are not decoding the text "with Unicode", you are decoding the text using the UTF-8 character set, into a unicode string. Unicode is not the same as UTF-8 (or any other character set for that matter). Read joelonsoftware.com/articles/Unicode.html for more information about the topic.

Collectives™ on Stack Overflow

Checking if certain words are on a web page using Python?

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related