web-scraping error message: 'int' object has no attribute 'get'

Question

Hello Stack Overflow contributors!

I want to scrape multiple pages of a news website; it shows an error message during this step

 response = requests.get(page, headers = user_agent)

The error message is

AttributeError: 'int' object has no attribute 'get'

The lines of code are

user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}

#controlling the crawl-rate
start_time = time() 
request = 0

def scrape(url):
    urls = [url + str(x) for x in range(0,10)]
    for page in urls:
        response = requests.get(page, headers = user_agent)   
    print(page)

print(scrape('https://nypost.com/search/China+COVID-19/page/'))

More specifically, this page and pages next to it are what I want to scrape:

https://nypost.com/search/China+COVID-19/page/1/?orderby=relevance

Any helps would be greatly appreciated!!

You most probably defined requests somewhere else in your code with an integer value. — magikarp
– magikarp, Commented May 14, 2020 at 18:41
adding to @Shreya comment, you must change the variable request to attempts and use f-strings instead format if you're using python3.6+ f'{'Request:{attempts}; Frequency: {request/elapsed_time} request/s'}' — Diego Magalhães
– Diego Magalhães, Commented May 14, 2020 at 18:48
True! I defined requests somewhere else; after removing, it works — Yue Peng
– Yue Peng, Commented May 14, 2020 at 18:51

DisappointedByUnaccountableMod · Accepted Answer · 2021-03-17 22:36:28Z

1

For me this code runs okay. I did have to put request inside your function. Make sure you do not mix up the module requests with your variable request.

from random import randint
from time import sleep, time
from bs4 import BeautifulSoup as bs


user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}

# controlling the crawl-rate
start_time = time() 

def scrape(url):
    request = 0
    urls = [f"{url}{x}" for x in range(0,10)]
    params = {
       "orderby": "relevance",
    }
    for page in urls:
        response = requests.get(url=page,
                                headers=user_agent,
                                params=params)   

        #pause the loop
        sleep(randint(8,15))

        #monitor the requests
        request += 1
        elapsed_time = time() - start_time
        print('Request:{}; Frequency: {} request/s'.format(request, request/elapsed_time))
#         clear_output(wait = True)

        #throw a warning for non-200 status codes
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(request, response.status_code))

        #Break the loop if the number of requests is greater than expected
        if request > 72:
            warn('Number of request was greater than expected.')
            break

        #parse the content
        soup_page = bs(response.text, 'lxml') 
        
print(scrape('https://nypost.com/search/China+COVID-19/page/'))

edited Mar 17, 2021 at 22:36

DisappointedByUnaccountableMod

6,8444 gold badges21 silver badges23 bronze badges

answered May 14, 2020 at 18:50

JQadrad

5413 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Yue Peng Over a year ago

Thanks for help! I wonder is there a way to add a string for the url? I aim to scrape this page and the following 9 pages nypost.com/search/China+COVID-19/page/1/?orderby=relevance

JQadrad Over a year ago

I am not sure what you mean. You could use still urls = [f"{url}{x}" for x in range(0,10)] and add params to your requests.get() as request.get(url=page, headers=user_agent, params=params where params = {"orderby": "relevance"}.

Yue Peng Over a year ago

Sorry for being unclear! I meant I want to scrape the pages that are sorted by relevance, the url is in my last comment. I think you understood me correctly. However, the adjusted code you posted doesn´t work! I printed pages after

for page in urls:         response = requests.get(url=page,                                 headers=user_agent,                                 params=params)

Yue Peng Over a year ago

It gives me the pages without params (without "?orderby= relevance"). I would appreciate if you help me fix it.

JQadrad Over a year ago

This code works for me. If you print response.url you see that the params are added.

|

Collectives™ on Stack Overflow

web-scraping error message: 'int' object has no attribute 'get'

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related