0

I'm trying to web scrape a web site through python, however I'm not able to retrieve the correct API with requests, because I can't get the product information:

This is the website, someone is able to get the API answer with products information, like name and price? Obs: It's important to notice that the web site product's loads as you scroll down.

https://www.atacadao.com.br/bebidas/

If i'm not able to do it through requests, I'll probably go for selenium, which I really wanted to avoid, because of its poor efficiency for scraping.

Thanks in advance :)

2
  • if page load data when you scroll down then it uses JavaScript but requests,Beatifulsoup can't run JavaScript and you may need to use Selenium to control real web browser which can run JavaScript Commented Nov 16, 2021 at 15:27
  • what did you try? Where is your code? We will not write all code for you. If you know that it use API then you could show all details which you got. And put all in question (not in comments) - it will be more readable and more people will see it. Commented Nov 16, 2021 at 15:27

1 Answer 1

2

Using DevTools in Firefox/Chrome (tab: Network, filter: xhr) I found that JavaScript read data as JSON from URL

https://www.atacadao.com.br/catalogo/search/?q=&category_id=null&category[]=bebidas&page=1&order_by=-relevance

So using requests I can run

import requests

url = 'https://www.atacadao.com.br/catalogo/search/?q=&category_id=null&category[]=bebidas&page=1&order_by=-relevance'

r = requests.get(url)

print(r.text[:1000])   # show only beginning of data
print('------------')

data = r.json()

for item in data['results'][:3]:  # I use `[:3]` to show only first three results
    #print(item.keys())

    #for key, val in item.items():
    #    print(f'{key}: {val}')

    print('name:', item['name'])
    print('price:', item['price'])
    print('url:', item['url'])

    print('---')

to get

{"paginator": {"page_range": [1, 2, 3, 4, 5], "page_number": 1, "last": 157, "first": "", "previous": "", "next": 2}, "results": [{"pk": 4854, "full_display": "Refrigerante lata 350ml - Coca Cola", "name": "Refrigerante", "brand": "Coca Cola", "type": "", "category": "Refrigerantes", "unit": "UN", "cart": {"cart": false, "multiplier": "", "count": "", "distributor_id": null, "distributor_name": null}, "photo_url": ["https://media.cotabest.com.br/media/sku/refrigerante-coca-cola-lata-350ml-coca-cola-un.png"], "price": {"price": "2,05", "multiplier": 6.0, "distributor_name": "ATACAD\u00c3O CD BEL\u00c9M", "distributor_id": 84022367}, "highlight": true, "price_statistics": {"quantity_prices": 20, "discount": 31, "cheaper": {"price": "2,05", "multiplier": 6.0, "distributor_name": "ATACAD\u00c3O CD BEL\u00c9M", "distributor_id": 84022367}, "expensive": "2.99"}, "multipliers": [{"unit_price": "2.05", "multiplier": "6.00", "distributor_id": 84022367}, {"unit_price": "2.05", "multiplier": "6.0
------------
name: Refrigerante
price: {'price': '2,05', 'multiplier': 6.0, 'distributor_name': 'ATACADÃO CD BELÉM', 'distributor_id': 84022367}
url: /refrigerante-coca-cola-lata-350ml
---
name: Refrigerante
price: {'price': '7,12', 'multiplier': 6.0, 'distributor_name': 'PMG', 'distributor_id': 4921}
url: /refrigerante-coca-cola-pet-2litros
---
name: Whisky
price: {'price': '89,00', 'multiplier': 12.0, 'distributor_name': 'ATACADÃO CD BETIM', 'distributor_id': 74133922}
url: /whisky-red-label-johnnie-walker-garrafa-1litro
---

Url has page=1 so I can use it with different values to load other pages.

But I will use dictionary with params to make it simpler

url = 'https://www.atacadao.com.br/catalogo/search/'

payload = {
    'q': '',
    'category_id': 'null',
    'category[]': 'bebidas',
    'page': 1,
    'order_by': '-relevance'
}

payload['page'] = 1  # 2, 3, etc.

r = requests.get(url, params=payload)

Full code

import requests

url = 'https://www.atacadao.com.br/catalogo/search/'

payload = {
    'q': '',
    'category_id': 'null',
    'category[]': 'bebidas',
    'page': 1,
    'order_by': '-relevance'
}

for number in range(1, 6):
    print('\n=== page:', number, '===\n')
    
    payload['page'] = number
    
    r = requests.get(url, params=payload)
    #print(r.text[:1000])

    data = r.json()

    for item in data['results']: #[:3]:  # I use `[:3]` to show only first three results
        #print(item.keys())
        print('name:', item['name'])
        print('price:', item['price'])
        print('url:', item['url'])
        print('---')

Result:

=== page: 1 ===

name: Refrigerante
price: {'price': '2,05', 'multiplier': 6.0, 'distributor_name': 'ATACADÃO CD BELÉM', 'distributor_id': 84022367}
url: /refrigerante-coca-cola-lata-350ml
---
name: Refrigerante
price: {'price': '7,12', 'multiplier': 6.0, 'distributor_name': 'PMG', 'distributor_id': 4921}
url: /refrigerante-coca-cola-pet-2litros
---
name: Whisky
price: {'price': '89,00', 'multiplier': 12.0, 'distributor_name': 'ATACADÃO CD BETIM', 'distributor_id': 74133922}
url: /whisky-red-label-johnnie-walker-garrafa-1litro
---

=== page: 2 ===

name: Whisky
price: {'price': '39,83', 'multiplier': 1.0, 'distributor_name': 'ATACADÃO CD IGARASSU', 'distributor_id': 95849062}
url: /whisky-escoces-passport-garrafa-1litro
---
name: Refrigerante
price: {'price': '1,95', 'multiplier': 6.0, 'distributor_name': 'ATACADÃO CD MANAUS', 'distributor_id': 84019700}
url: /refrigerante-laranja-fanta-lata-350ml
---
name: Suco Integral
price: {'price': '10,97', 'multiplier': 6.0, 'distributor_name': 'ATACADÃO CD VILA VELHA', 'distributor_id': 96142380}
url: /suco-integral-sabor-uva-aurora-vidro-15litros
---

BTW:

In JSON you can see

"paginator": {"page_range": [1, 2, 3, 4, 5], "page_number": 1, "last": 157, "first": "", "previous": "", "next": 2}

and you could use while True loop with number = data["paginator"]["next"] to load all pages.

I checked that last page has empty string in next.

number = "1"

while True:

    print('\n=== page:', number, '===\n')
    
    payload['page'] = number
    
    r = requests.get(url, params=payload)
    #print(r.text[:1000])   # show only beginning of data

    data = r.json()

    for item in data['results'][:3]:   # show only first three results
        #print(item.keys())
        print('name:', item['name'])
        print('price:', item['price'])
        print('url:', item['url'])
        print('---')
        
    number = data['pagination']['next']
    
    if not number:
        break

I put code from my answer on GitHub python-examples in folder __scraping__

Sign up to request clarification or add additional context in comments.

3 Comments

I put code from my answer on GitHub python-examples in folder __scraping__
Thank you very much, it worked out! And thanks also for your explanation, some concepts are new to me, because I started learning how to retrieve API data only this week, and I did just one web-site taking API data with requests so far. I'll try to study better your code, and probably I'll return with some doubts, if this is ok for you haha.. And I'm gonna follow you in your git! Regards :)
BTW: there is page with fake data to learn scraping toscrape.com it was created by authors of module Scrapy

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.