0

Hello Stack Overflow contributors!

I want to scrape multiple pages of a news website; it shows an error message during this step

 response = requests.get(page, headers = user_agent)

The error message is

AttributeError: 'int' object has no attribute 'get'

The lines of code are

user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}

#controlling the crawl-rate
start_time = time() 
request = 0

def scrape(url):
    urls = [url + str(x) for x in range(0,10)]
    for page in urls:
        response = requests.get(page, headers = user_agent)   
    print(page)
       
print(scrape('https://nypost.com/search/China+COVID-19/page/'))

More specifically, this page and pages next to it are what I want to scrape:

https://nypost.com/search/China+COVID-19/page/1/?orderby=relevance

Any helps would be greatly appreciated!!

3
  • 1
    You most probably defined requests somewhere else in your code with an integer value. Commented May 14, 2020 at 18:41
  • adding to @Shreya comment, you must change the variable request to attempts and use f-strings instead format if you're using python3.6+ f'{'Request:{attempts}; Frequency: {request/elapsed_time} request/s'}' Commented May 14, 2020 at 18:48
  • True! I defined requests somewhere else; after removing, it works Commented May 14, 2020 at 18:51

1 Answer 1

1

For me this code runs okay. I did have to put request inside your function. Make sure you do not mix up the module requests with your variable request.

from random import randint
from time import sleep, time
from bs4 import BeautifulSoup as bs


user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; rv:11.0) like Gecko'}

# controlling the crawl-rate
start_time = time() 

def scrape(url):
    request = 0
    urls = [f"{url}{x}" for x in range(0,10)]
    params = {
       "orderby": "relevance",
    }
    for page in urls:
        response = requests.get(url=page,
                                headers=user_agent,
                                params=params)   

        #pause the loop
        sleep(randint(8,15))

        #monitor the requests
        request += 1
        elapsed_time = time() - start_time
        print('Request:{}; Frequency: {} request/s'.format(request, request/elapsed_time))
#         clear_output(wait = True)

        #throw a warning for non-200 status codes
        if response.status_code != 200:
            warn('Request: {}; Status code: {}'.format(request, response.status_code))

        #Break the loop if the number of requests is greater than expected
        if request > 72:
            warn('Number of request was greater than expected.')
            break

        #parse the content
        soup_page = bs(response.text, 'lxml') 
        
print(scrape('https://nypost.com/search/China+COVID-19/page/'))
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks for help! I wonder is there a way to add a string for the url? I aim to scrape this page and the following 9 pages nypost.com/search/China+COVID-19/page/1/?orderby=relevance
I am not sure what you mean. You could use still urls = [f"{url}{x}" for x in range(0,10)] and add params to your requests.get() as request.get(url=page, headers=user_agent, params=params where params = {"orderby": "relevance"}.
Sorry for being unclear! I meant I want to scrape the pages that are sorted by relevance, the url is in my last comment. I think you understood me correctly. However, the adjusted code you posted doesn´t work! I printed pages after for page in urls: response = requests.get(url=page, headers=user_agent, params=params)
It gives me the pages without params (without "?orderby= relevance"). I would appreciate if you help me fix it.
This code works for me. If you print response.url you see that the params are added.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.