0

I have a function that scrapes href link form a particular page and returns the result. I want to call this function in a parallel way to save time. I have visited this problem Running same function for multiple files in parallel in python But the challenge is that I need to save the return element in a list. How can I do that? Here is my code snippet.

url = "https://www.programmableweb.com/category/all/apis"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data,'html.parser')

#function to scrape individual pages
def scrap_api_url(i):
    print(i)
    page_url = "https://www.programmableweb.com" + mid_url + '=' + str(i)
    response = requests.get(page_url)
    data = response.text
    soup = BeautifulSoup(data, 'html.parser')
    all_api = soup.find_all('tr', class_ = re.compile('^(even|odd)$'))
    return all_api

url_tag = soup.find('a',{'title' : 'Go to next page'})
mid_url = url_tag.get('href').split('=')[0]
threads=[]

#calling functions
if __name__ == '__main__':
    inputs = [i for i in range(851)]
    for item in inputs:
        print('Thread Started :: ', item)
        t = threading.Thread(target = scrap_api_url, args=(item,))
        threads.append(t)
        t.start()
h = []        
for t in threads:
    h.append(t.join())
2
  • It sounds like you want a multiprocessing map Commented May 29, 2020 at 11:43
  • Yes, I am using multiprocessing, however I have tried with pool, but its not working it just stucks. Commented May 29, 2020 at 14:15

1 Answer 1

0

You can use the ThreadPoolExecutor map method:

import re
from concurrent.futures import ThreadPoolExecutor

import requests
from bs4 import BeautifulSoup


def main():
    url = "https://www.programmableweb.com/category/all/apis"
    response = requests.get(url)
    data = response.text
    soup = BeautifulSoup(data,'html.parser')

    url_tag = soup.find('a',{'title' : 'Go to next page'})
    mid_url = url_tag.get('href').split('=')[0]

    # function to scrape individual pages
    def scrap_api_url(i):
        print(i)
        page_url = "https://www.programmableweb.com" + mid_url + '=' + str(i)
        response = requests.get(page_url)
        data = response.text
        soup = BeautifulSoup(data, 'html.parser')
        all_api = soup.find_all('tr', class_=re.compile('^(even|odd)$'))
        return all_api

    inputs = [i for i in range(851)]
    with ThreadPoolExecutor() as executor:
        future_results = executor.map(scrap_api_url, inputs)
        results = [result for result in future_results]

    print(results)

#calling functions
if __name__ == '__main__':
    main()

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @Sylvaus. It just worked the way I wanted. However, I need to learn more about ThreadPoolExecutor..Thanks a lot
No problem. By the way, multiprocessing is most likely not the solution in your case as your problem seems I/O bound and not CPU bound

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.