Web Scraping with Python in combination with asyncio

Question

I've written a script in python to get some information from a webpage. The code itself is running flawlessly if it is taken out of the asyncio. However, as my script runs synchronously I wanted to make it go through asyncronous process so that it accomplishes the task within the shortest possible time providing optimum performance and obviously not in a blocking manner. As i didn't ever work with this asyncio library, I'm seriously confused how to make it a go. I've tried to fit my script within the asyncio process but it doesn't seem right. If somebody stretches a helping hand to complete this, I would really be grateful to him. Thanks is advance. Here is my erroneous code:

import requests ; from lxml import html
import asyncio

link = "http://quotes.toscrape.com/"

async def quotes_scraper(base_link):
        response = requests.get(base_link)
        tree = html.fromstring(response.text)
        for titles in tree.cssselect("span.tag-item a.tag"):
            processing_docs(base_link + titles.attrib['href'])

async def processing_docs(base_link):
        response = requests.get(base_link).text
        root = html.fromstring(response)
        for soups in root.cssselect("div.quote"):
            quote = soups.cssselect("span.text")[0].text
            author = soups.cssselect("small.author")[0].text
            print(quote, author)


        next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else ""
        if next_page:
            page_link = link + next_page
            processing_docs(page_link)

loop = asyncio.get_event_loop()
loop.run_until_complete(quotes_scraper(link))
loop.close()

Upon execution what I see in the console is:

RuntimeWarning: coroutine 'processing_docs' was never awaited
  processing_docs(base_link + titles.attrib['href'])

What is the point of asyncio usage in your program? requests performs HTTP queries synchronously anyway. You need either run requests code by loop.run_in_executor() or replace requests with aiohttp — Andrew Svetlov
– Andrew Svetlov, Commented Sep 6, 2017 at 10:35
@Andrew Svetlov, I got confused to see your comment. I really don't have good knowledge on this. Did I waste my time in vain then? I thought the program would run asynchronously - to be more specific: requests will be processed simultaneously other than queuing for one request to be completed. — SIM
– SIM, Commented Sep 6, 2017 at 12:31
No, requests is a synchronous library. You could figure it out by absence await before requests.get() call. — Andrew Svetlov
– Andrew Svetlov, Commented Sep 6, 2017 at 15:59
BTW github.com/aosabook/500lines/tree/master/crawler is a async crawler example. It's written by Guido van Rossum and A. Jesse Jiryu Davis, uses aiohttp under the hood. To be clear: I'm an aiohttp maintainer and wrote about of quarter of asyncio source code. I know what I'm saying very well. — Andrew Svetlov
– Andrew Svetlov, Commented Sep 6, 2017 at 16:03
Thanks Andrew Svetlov, for your suggestion and the link. I'll go it through for sure. — SIM
– SIM, Commented Sep 6, 2017 at 16:16

alecxe · Accepted Answer · 2017-09-05 17:49:28Z

5

You need to call processing_docs() with await.

Replace:

processing_docs(base_link + titles.attrib['href'])

with:

await processing_docs(base_link + titles.attrib['href'])

And replace:

processing_docs(page_link)

with:

await processing_docs(page_link)

Otherwise it tries to run an asynchronous function synchronously and gets upset!

edited Sep 5, 2017 at 17:49

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

answered Sep 5, 2017 at 13:50

James Wilson

95013 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Web Scraping with Python in combination with asyncio

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related