Running a function in parallel and saving the return in a list using python

Question

I have a function that scrapes href link form a particular page and returns the result. I want to call this function in a parallel way to save time. I have visited this problem Running same function for multiple files in parallel in python But the challenge is that I need to save the return element in a list. How can I do that? Here is my code snippet.

url = "https://www.programmableweb.com/category/all/apis"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data,'html.parser')

#function to scrape individual pages
def scrap_api_url(i):
    print(i)
    page_url = "https://www.programmableweb.com" + mid_url + '=' + str(i)
    response = requests.get(page_url)
    data = response.text
    soup = BeautifulSoup(data, 'html.parser')
    all_api = soup.find_all('tr', class_ = re.compile('^(even|odd)$'))
    return all_api

url_tag = soup.find('a',{'title' : 'Go to next page'})
mid_url = url_tag.get('href').split('=')[0]
threads=[]

#calling functions
if __name__ == '__main__':
    inputs = [i for i in range(851)]
    for item in inputs:
        print('Thread Started :: ', item)
        t = threading.Thread(target = scrap_api_url, args=(item,))
        threads.append(t)
        t.start()
h = []        
for t in threads:
    h.append(t.join())

Yes, I am using multiprocessing, however I have tried with pool, but its not working it just stucks. — Soubhik Karmakar
– Soubhik Karmakar, Commented May 29, 2020 at 14:15

Sylvaus · Accepted Answer · 2020-05-29 12:51:02Z

0

You can use the ThreadPoolExecutor map method:

import re
from concurrent.futures import ThreadPoolExecutor

import requests
from bs4 import BeautifulSoup


def main():
    url = "https://www.programmableweb.com/category/all/apis"
    response = requests.get(url)
    data = response.text
    soup = BeautifulSoup(data,'html.parser')

    url_tag = soup.find('a',{'title' : 'Go to next page'})
    mid_url = url_tag.get('href').split('=')[0]

    # function to scrape individual pages
    def scrap_api_url(i):
        print(i)
        page_url = "https://www.programmableweb.com" + mid_url + '=' + str(i)
        response = requests.get(page_url)
        data = response.text
        soup = BeautifulSoup(data, 'html.parser')
        all_api = soup.find_all('tr', class_=re.compile('^(even|odd)$'))
        return all_api

    inputs = [i for i in range(851)]
    with ThreadPoolExecutor() as executor:
        future_results = executor.map(scrap_api_url, inputs)
        results = [result for result in future_results]

    print(results)

#calling functions
if __name__ == '__main__':
    main()

answered May 29, 2020 at 12:51

Sylvaus

9047 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Soubhik Karmakar Over a year ago

Thanks @Sylvaus. It just worked the way I wanted. However, I need to learn more about ThreadPoolExecutor..Thanks a lot

Sylvaus Over a year ago

No problem. By the way, multiprocessing is most likely not the solution in your case as your problem seems I/O bound and not CPU bound

Collectives™ on Stack Overflow

Running a function in parallel and saving the return in a list using python

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related