8

I've written a script in python to get some information from a webpage. The code itself is running flawlessly if it is taken out of the asyncio. However, as my script runs synchronously I wanted to make it go through asyncronous process so that it accomplishes the task within the shortest possible time providing optimum performance and obviously not in a blocking manner. As i didn't ever work with this asyncio library, I'm seriously confused how to make it a go. I've tried to fit my script within the asyncio process but it doesn't seem right. If somebody stretches a helping hand to complete this, I would really be grateful to him. Thanks is advance. Here is my erroneous code:

import requests ; from lxml import html
import asyncio

link = "http://quotes.toscrape.com/"

async def quotes_scraper(base_link):
        response = requests.get(base_link)
        tree = html.fromstring(response.text)
        for titles in tree.cssselect("span.tag-item a.tag"):
            processing_docs(base_link + titles.attrib['href'])

async def processing_docs(base_link):
        response = requests.get(base_link).text
        root = html.fromstring(response)
        for soups in root.cssselect("div.quote"):
            quote = soups.cssselect("span.text")[0].text
            author = soups.cssselect("small.author")[0].text
            print(quote, author)


        next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else ""
        if next_page:
            page_link = link + next_page
            processing_docs(page_link)

loop = asyncio.get_event_loop()
loop.run_until_complete(quotes_scraper(link))
loop.close()

Upon execution what I see in the console is:

RuntimeWarning: coroutine 'processing_docs' was never awaited
  processing_docs(base_link + titles.attrib['href'])
5
  • 1
    What is the point of asyncio usage in your program? requests performs HTTP queries synchronously anyway. You need either run requests code by loop.run_in_executor() or replace requests with aiohttp Commented Sep 6, 2017 at 10:35
  • @Andrew Svetlov, I got confused to see your comment. I really don't have good knowledge on this. Did I waste my time in vain then? I thought the program would run asynchronously - to be more specific: requests will be processed simultaneously other than queuing for one request to be completed. Commented Sep 6, 2017 at 12:31
  • 1
    No, requests is a synchronous library. You could figure it out by absence await before requests.get() call. Commented Sep 6, 2017 at 15:59
  • 3
    BTW github.com/aosabook/500lines/tree/master/crawler is a async crawler example. It's written by Guido van Rossum and A. Jesse Jiryu Davis, uses aiohttp under the hood. To be clear: I'm an aiohttp maintainer and wrote about of quarter of asyncio source code. I know what I'm saying very well. Commented Sep 6, 2017 at 16:03
  • Thanks Andrew Svetlov, for your suggestion and the link. I'll go it through for sure. Commented Sep 6, 2017 at 16:16

1 Answer 1

5

You need to call processing_docs() with await.

Replace:

processing_docs(base_link + titles.attrib['href'])

with:

await processing_docs(base_link + titles.attrib['href'])

And replace:

processing_docs(page_link)

with:

await processing_docs(page_link)

Otherwise it tries to run an asynchronous function synchronously and gets upset!

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.