0

My current code creates the separate Session object for every request through the .get() method:

content_getters.py (the relevant part):

def get_page_content(link: str) -> bytes:
    headers = {"User-Agent": "Mozilla/5.0 (Macintosh; "
                             "Intel Mac OS X 10_11_6) "
                             "AppleWebKit/537.36 (KHTML, like Gecko) "
                             "Chrome/61.0.3163.100 Safari/537.36"}

    response = requests.get(link, headers=headers)

    html = response.content.decode("utf-8")

    if response.status_code != requests.codes.ok:
        raise ConnectionError("Page", link, "returned status code",
                              response.status_code)

    return response.content

def parse_single_page(link):
    content = get_page_conent(link)
    # rest of very long function

main.py:

from concurrent.futures.thread import ThreadPoolExecutor

from content_getters import get_page_content, extract_links, parse_single_page

if __name__ == "__main__":
    MAX_THREADS = 30

    # get links
    html: str = get_page_content(
        "https://www.d20pfsrd.com/bestiary/bestiary-hub/monsters-by-cr/") \
        .decode("utf-8")

    links = extract_links(html)

    num_threads = min(MAX_THREADS, len(links))
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        # asynchronous, threads will return results when they finish their
        # own work
        results = [result for result
                   in executor.map(parse_single_page, links)]

requests docs (link) state that "if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase". I suppose that my separate calls to the .get() method create separate Session objects for each call, which can be faster.

Question: Is the Session object synchronous (sequential) for all requests made with it? Will I still get asynchronous requests if I use the same Session object for all threads in concurrent.futures.thread.ThreadPoolExecutor, instead of 1 Session per thread as I'm doing now?

1

2 Answers 2

2

In short, Session is not thread-safe, you can check the issue discussion on Github.

For your case, I would highly recommend to look toward the asyncio and the aiohttp module, where you will have freedom to pass around a session since everything will be in one thread. It also won't induce as much overhead as the multithreading. As they say:

Use asyncio when you can, use threads when you must

The documentation on aiohttp.

Sign up to request clarification or add additional context in comments.

2 Comments

Very interesting, I completely forgot about the asyncio, thanks!
@qalis it is awesome! May have a bit of learning curve, but totally worth it. Would suggest this article to check before official documentation, which is quite verbose.
2

As per the documentation, requests.Session uses urllib3's connection pooling for the sessions. And as per urllib3's documentation, it is a thread-safe system now.

When the question was originally posted it probably wasn't, but in a GitHub comment, it was most likely made thread-safe for good.

1 Comment

Just adding a link to the commit based on the comment you posted github.com/urllib3/urllib3/pull/2661

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.