0

What am I doing wrong?

from urllib import request

def get_page(page):
    page = request.urlopen(page).read()
    return page

def get_next_target(page):
    start_link = page.find("<a href=")
    if(start_link == -1):
        return None
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote+1)
    url = page[start_quote+1:end_quote]
    print(url)
    return(url,end_quote)

def print_all_links(page):
    while True:
        url, endpos = get_next_target(page)
        if url:
            print(url)
            page = page[endpos:]
        else:
            break

page = get_page('https://xkcd.com/')
print(page)
get_next_target(page)
#print_all_links(page)

The error is

Traceback (most recent call last):
  File "./xkcdscrape.py", line 29, in <module>
    get_next_target(page)
  File "./xkcdscrape.py", line 8, in get_next_target
    start_link = page.find("<a href=")
TypeError: a bytes-like object is required, not 'str'
0

1 Answer 1

1

The return type of read is bytes. In your get_page function call decode to convert the bytes to a string.

def get_page(page):
    page = request.urlopen(page).read()
    return page.decode('utf-8')

You can read more about using urllib to fetch internet resource here. However requests provides a simpler interface for such tasks.

It's also simpler to do web scraping using a library like Beautiful Soup.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.