Python unable to get API with requests: Web scraping, Requests, API

Question

I'm trying to web scrape a web site through python, however I'm not able to retrieve the correct API with requests, because I can't get the product information:

This is the website, someone is able to get the API answer with products information, like name and price? Obs: It's important to notice that the web site product's loads as you scroll down.

https://www.atacadao.com.br/bebidas/

If i'm not able to do it through requests, I'll probably go for selenium, which I really wanted to avoid, because of its poor efficiency for scraping.

Thanks in advance :)

if page load data when you scroll down then it uses JavaScript but requests,Beatifulsoup can't run JavaScript and you may need to use Selenium to control real web browser which can run JavaScript — furas
– furas, Commented Nov 16, 2021 at 15:27
what did you try? Where is your code? We will not write all code for you. If you know that it use API then you could show all details which you got. And put all in question (not in comments) - it will be more readable and more people will see it. — furas
– furas, Commented Nov 16, 2021 at 15:27

furas · Accepted Answer · 2021-11-16 16:05:56Z

Using DevTools in Firefox/Chrome (tab: Network, filter: xhr) I found that JavaScript read data as JSON from URL

https://www.atacadao.com.br/catalogo/search/?q=&category_id=null&category[]=bebidas&page=1&order_by=-relevance

So using requests I can run

import requests

url = 'https://www.atacadao.com.br/catalogo/search/?q=&category_id=null&category[]=bebidas&page=1&order_by=-relevance'

r = requests.get(url)

print(r.text[:1000])   # show only beginning of data
print('------------')

data = r.json()

for item in data['results'][:3]:  # I use `[:3]` to show only first three results
    #print(item.keys())

    #for key, val in item.items():
    #    print(f'{key}: {val}')

    print('name:', item['name'])
    print('price:', item['price'])
    print('url:', item['url'])

    print('---')

to get

{"paginator": {"page_range": [1, 2, 3, 4, 5], "page_number": 1, "last": 157, "first": "", "previous": "", "next": 2}, "results": [{"pk": 4854, "full_display": "Refrigerante lata 350ml - Coca Cola", "name": "Refrigerante", "brand": "Coca Cola", "type": "", "category": "Refrigerantes", "unit": "UN", "cart": {"cart": false, "multiplier": "", "count": "", "distributor_id": null, "distributor_name": null}, "photo_url": ["https://media.cotabest.com.br/media/sku/refrigerante-coca-cola-lata-350ml-coca-cola-un.png"], "price": {"price": "2,05", "multiplier": 6.0, "distributor_name": "ATACAD\u00c3O CD BEL\u00c9M", "distributor_id": 84022367}, "highlight": true, "price_statistics": {"quantity_prices": 20, "discount": 31, "cheaper": {"price": "2,05", "multiplier": 6.0, "distributor_name": "ATACAD\u00c3O CD BEL\u00c9M", "distributor_id": 84022367}, "expensive": "2.99"}, "multipliers": [{"unit_price": "2.05", "multiplier": "6.00", "distributor_id": 84022367}, {"unit_price": "2.05", "multiplier": "6.0
------------
name: Refrigerante
price: {'price': '2,05', 'multiplier': 6.0, 'distributor_name': 'ATACADÃO CD BELÉM', 'distributor_id': 84022367}
url: /refrigerante-coca-cola-lata-350ml
---
name: Refrigerante
price: {'price': '7,12', 'multiplier': 6.0, 'distributor_name': 'PMG', 'distributor_id': 4921}
url: /refrigerante-coca-cola-pet-2litros
---
name: Whisky
price: {'price': '89,00', 'multiplier': 12.0, 'distributor_name': 'ATACADÃO CD BETIM', 'distributor_id': 74133922}
url: /whisky-red-label-johnnie-walker-garrafa-1litro
---

Url has page=1 so I can use it with different values to load other pages.

But I will use dictionary with params to make it simpler

url = 'https://www.atacadao.com.br/catalogo/search/'

payload = {
    'q': '',
    'category_id': 'null',
    'category[]': 'bebidas',
    'page': 1,
    'order_by': '-relevance'
}

payload['page'] = 1  # 2, 3, etc.

r = requests.get(url, params=payload)

Full code

import requests

url = 'https://www.atacadao.com.br/catalogo/search/'

payload = {
    'q': '',
    'category_id': 'null',
    'category[]': 'bebidas',
    'page': 1,
    'order_by': '-relevance'
}

for number in range(1, 6):
    print('\n=== page:', number, '===\n')
    
    payload['page'] = number
    
    r = requests.get(url, params=payload)
    #print(r.text[:1000])

    data = r.json()

    for item in data['results']: #[:3]:  # I use `[:3]` to show only first three results
        #print(item.keys())
        print('name:', item['name'])
        print('price:', item['price'])
        print('url:', item['url'])
        print('---')

Result:

=== page: 1 ===

name: Refrigerante
price: {'price': '2,05', 'multiplier': 6.0, 'distributor_name': 'ATACADÃO CD BELÉM', 'distributor_id': 84022367}
url: /refrigerante-coca-cola-lata-350ml
---
name: Refrigerante
price: {'price': '7,12', 'multiplier': 6.0, 'distributor_name': 'PMG', 'distributor_id': 4921}
url: /refrigerante-coca-cola-pet-2litros
---
name: Whisky
price: {'price': '89,00', 'multiplier': 12.0, 'distributor_name': 'ATACADÃO CD BETIM', 'distributor_id': 74133922}
url: /whisky-red-label-johnnie-walker-garrafa-1litro
---

=== page: 2 ===

name: Whisky
price: {'price': '39,83', 'multiplier': 1.0, 'distributor_name': 'ATACADÃO CD IGARASSU', 'distributor_id': 95849062}
url: /whisky-escoces-passport-garrafa-1litro
---
name: Refrigerante
price: {'price': '1,95', 'multiplier': 6.0, 'distributor_name': 'ATACADÃO CD MANAUS', 'distributor_id': 84019700}
url: /refrigerante-laranja-fanta-lata-350ml
---
name: Suco Integral
price: {'price': '10,97', 'multiplier': 6.0, 'distributor_name': 'ATACADÃO CD VILA VELHA', 'distributor_id': 96142380}
url: /suco-integral-sabor-uva-aurora-vidro-15litros
---

BTW:

In JSON you can see

"paginator": {"page_range": [1, 2, 3, 4, 5], "page_number": 1, "last": 157, "first": "", "previous": "", "next": 2}

and you could use while True loop with number = data["paginator"]["next"] to load all pages.

I checked that last page has empty string in next.

number = "1"

while True:

    print('\n=== page:', number, '===\n')
    
    payload['page'] = number
    
    r = requests.get(url, params=payload)
    #print(r.text[:1000])   # show only beginning of data

    data = r.json()

    for item in data['results'][:3]:   # show only first three results
        #print(item.keys())
        print('name:', item['name'])
        print('price:', item['price'])
        print('url:', item['url'])
        print('---')
        
    number = data['pagination']['next']
    
    if not number:
        break

I put code from my answer on GitHub python-examples in folder __scraping__

I put code from my answer on GitHub python-examples in folder __scraping__
Thank you very much, it worked out! And thanks also for your explanation, some concepts are new to me, because I started learning how to retrieve API data only this week, and I did just one web-site taking API data with requests so far. I'll try to study better your code, and probably I'll return with some doubts, if this is ok for you haha.. And I'm gonna follow you in your git! Regards :)
BTW: there is page with fake data to learn scraping toscrape.com it was created by authors of module Scrapy

Collectives™ on Stack Overflow

Python unable to get API with requests: Web scraping, Requests, API

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related