2

I'm trying to learn how to use asyncio to build an asynchronous web crawler. The following is a crude crawler to test out the framework:

import asyncio, aiohttp
from bs4 import BeautifulSoup

@asyncio.coroutine
def fetch(url):
    with (yield from sem):
        print(url)
        response = yield from aiohttp.request('GET',url)
        response = yield from response.read_and_close()
    return response.decode('utf-8')

@asyncio.coroutine
def get_links(url):
    page = yield from fetch(url)
    soup = BeautifulSoup(page)
    links = soup.find_all('a',href=True)
    return [link['href'] for link in links if link['href'].find('www') != -1]

@asyncio.coroutine
def crawler(seed, depth, max_depth=3):
    while True:
        if depth > max_depth:
            break
        links = yield from get_links(seed)
        depth+=1
        coros = [asyncio.Task(crawler(link,depth)) for link in links]
        yield from asyncio.gather(*coros)

sem = asyncio.Semaphore(5)
loop = asyncio.get_event_loop()
loop.run_until_complete(crawler("http://www.bloomberg.com",0))

Whilst asyncio seems to be documented quite well, aiohttp seems to have very little documentation so I'm struggling to work some things out myself.

Firstly, is there a way for us to detect the encoding of page response? Secondly, can we request that the connections are kept-alive within a session? Or is this by default True like in requests?

1 Answer 1

1

You can look on response.headers['Content-Type'] or use chardet library for bad-formed HTTP responses. Response body is bytes string.

For keep-alive connections you should use connector like:

connector = aiohttp.TCPConnector(share_cookies=True)

response1 = yield from aiohttp.request('get', url1, connector=connector)
body1 = yield from response1.read_and_close()
response2 = aiohttp.request('get', url2, connector=connector)
body2 = yield from response2.read_and_close()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.