3

I want to get data from multiples pages about 10000 pages with number arrays. But one by one is taking so long and I'm new in Python so I don't know much about multithreading and asychronism in this language

The code works fine, it takes all the data expected, but it takes several minutes to do this. And I know that it could probably be done faster if I'd do more than a request per time

import http.client
import json

def get_all_data():
    connection = http.client.HTTPConnection("localhost:5000")
    page = 1
    data = {}

    while True:
        try:

            api_url = f'/api/numbers?page={page}'
            connection.request('GET', api_url)
            response = connection.getresponse()

            if(response.status is 200):
                data[f'{page}'] = json.loads(response.read())['numbers']
                items_returned = len(data[f'{page}'])
                print(f'Por Favor, Aguarde. Obtendo os Dados... Request: {page} -- Itens Retornados: {items_returned}')
                page += 1
                if items_returned == 0 or items_returned == None :
                    break
    except:
        connection.close()

print('Todas as Requisições Concluídas!')
return data

How to refactor this code to do multiple requests at once sequentially instead one by one?

3
  • Take a look at How to use threading in Python? Commented Jan 11, 2019 at 6:33
  • I understand what was going on in stackoverflow.com/a/2846697/5921486 but in that case he was creating multiple threads for multiples URLs. In my case I'm always using the same url just changing the params. How deal with it??? Because to increment the page param to go to the next I got to have a positive http response first... Commented Jan 11, 2019 at 6:54
  • If you absolutely must wait for a positive response before you do another request, then... you can't do another request. What your wanting is impossible under those circumstance. But I don't really see why it would be bad to request page 2 before you get a positive response from page 1. I found this library that is meant for doing simultaneous HTTP requests. Commented Jan 12, 2019 at 17:44

2 Answers 2

1

Basically there are three ways of doing this kind of job, multithreading, multiprocessing, and async way, as mentioned by ACE the page parameter exists because of server dynamically generate template and number of pages may change over time due to the database update. the easiest way of doing this can be batch job, and try to put each batch into a try exception block, and handling the last part(not enough for one batch) separately. you can set the numer of jobs in each batch as a variable and try different solutions.

Sign up to request clarification or add additional context in comments.

Comments

0

Your parameter page (producer) is dynamic and it relies on the last request (consumer). Unless you can separate the producer, you can't use coroutines or multithreading.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.