1

I have a list of words, in Python I need to loop through each word and check if the word is on a website.

Currently, this is a snippet of what I have (relating to this problem):

words = ['word', 'word1', 'word2']
site = urllib.request.urlopen(link)
for word in words:
    if word in site:
       print(word)
    else:
       print(word, "not found")

I have a list of words, I open the site, and I loop through each word checking for the word in the site. Note that I am using a website with all those words found on it (I set it up myself and I can verify it works) and the link is the url of the website.

The problem is, I always go to "word not found", and it never seems to find the words on the website.

What's wrong with the code? It seems to be a semantics error, because the syntax works fine, and there are no exceptions thrown (although in my final I do have exception handling, but it will still report if exceptions are thrown anyways).

3
  • @larsmans What do you mean by urllib.request.urlopen being a blatant error? What's wrong with it? Commented Nov 20, 2011 at 20:19
  • 2
    @larsmans: What's wrong with urllib.request.urlopen? Perhaps you're not familiar with Python 3's standard libraries? Commented Nov 20, 2011 at 20:19
  • @GregHewgill, Bhaxy: excuse me, misinterpreted my error messages. I'm not indeed not up to speed with the Python 3 library yet. Commented Nov 20, 2011 at 20:20

2 Answers 2

7

The urlopen() function returns a "file-like object". In order to read the data, you must call read():

site = urllib.request.urlopen(link).read()

There are other ways to read the data too, but this is a simple way to load the whole page data into memory for quick searching.

The reason your code worked as written is because a file-like object is also iterable, which means it can be used with the in operator. But it wasn't doing what you wanted.

Sign up to request clarification or add additional context in comments.

2 Comments

Okay, so it seems to work now, and I've done print(site) and I've seen that it did seem to download the site, but the words that I put on the site are in the download. I used my code, and I also copied and pasted the result into notepad, and couldn't find the result there either. What's wrong?
It's possible that the words you are looking for are not present in the HTML that is downloaded. Maybe they're added to the DOM later using Javascript when the page is loaded in a browser. Without more information about the page you are loading and the words you're looking for, it's difficult to provide a more specific answer.
2

It also helps if you decode the links contents. Otherwise it is read as bytes. I had a similar problem. try

temp = urllib.request.urlopen(link)
HTML = temp.read().decode("utf-8")

this will decode the link using Unicode the link may not be encoded with Unicode. you can find out the encoding be requesting the header of the site

3 Comments

Thank you, this solves the question I asked in the comment to Greg Hewgill's answer.
temp.getheader('Content-Type') should return info on the encode type. if that doesn't work just try utf-8 that usually works on English language sites
Just to clear up a common mistake in this. You are not decoding the text "with Unicode", you are decoding the text using the UTF-8 character set, into a unicode string. Unicode is not the same as UTF-8 (or any other character set for that matter). Read joelonsoftware.com/articles/Unicode.html for more information about the topic.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.