How to speed up multiple sequential http requests in Python using Http.Client

Question

I want to get data from multiples pages about 10000 pages with number arrays. But one by one is taking so long and I'm new in Python so I don't know much about multithreading and asychronism in this language

The code works fine, it takes all the data expected, but it takes several minutes to do this. And I know that it could probably be done faster if I'd do more than a request per time

import http.client
import json

def get_all_data():
    connection = http.client.HTTPConnection("localhost:5000")
    page = 1
    data = {}

    while True:
        try:

            api_url = f'/api/numbers?page={page}'
            connection.request('GET', api_url)
            response = connection.getresponse()

            if(response.status is 200):
                data[f'{page}'] = json.loads(response.read())['numbers']
                items_returned = len(data[f'{page}'])
                print(f'Por Favor, Aguarde. Obtendo os Dados... Request: {page} -- Itens Retornados: {items_returned}')
                page += 1
                if items_returned == 0 or items_returned == None :
                    break
    except:
        connection.close()

print('Todas as Requisições Concluídas!')
return data

How to refactor this code to do multiple requests at once sequentially instead one by one?

I understand what was going on in stackoverflow.com/a/2846697/5921486 but in that case he was creating multiple threads for multiples URLs. In my case I'm always using the same url just changing the params. How deal with it??? Because to increment the page param to go to the next I got to have a positive http response first... — Douglas Ferreira
– Douglas Ferreira, Commented Jan 11, 2019 at 6:54
If you absolutely must wait for a positive response before you do another request, then... you can't do another request. What your wanting is impossible under those circumstance. But I don't really see why it would be bad to request page 2 before you get a positive response from page 1. I found this library that is meant for doing simultaneous HTTP requests. — hostingutilities.com
– hostingutilities.com, Commented Jan 12, 2019 at 17:44

minglyu · Accepted Answer · 2019-01-11 09:43:20Z

1

Basically there are three ways of doing this kind of job, multithreading, multiprocessing, and async way, as mentioned by ACE the page parameter exists because of server dynamically generate template and number of pages may change over time due to the database update. the easiest way of doing this can be batch job, and try to put each batch into a try exception block, and handling the last part(not enough for one batch) separately. you can set the numer of jobs in each batch as a variable and try different solutions.

answered Jan 11, 2019 at 9:43

minglyu

3,3872 gold badges17 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ACE Fly · Accepted Answer · 2019-01-11 09:23:44Z

0

Your parameter page (producer) is dynamic and it relies on the last request (consumer). Unless you can separate the producer, you can't use coroutines or multithreading.

answered Jan 11, 2019 at 9:23

ACE Fly

4923 silver badges11 bronze badges

Collectives™ on Stack Overflow

How to speed up multiple sequential http requests in Python using Http.Client

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related