0

I have a list of user ids and I'm interested in crawling their reputation.

I wrote a script using beautifulsoup that crawls users reputation. But the problem is, I get Too many requests error when my script has run for less than a minute. After that, I am unable to open the Stack Overflow manually on browser too.

My question is, how do I crawl the reputation without getting too many request error?

My code is given below:

for id in df['target']:
    url='https://stackoverflow.com/users/'+str(id)
    print(url)
    response=get(url)
    html_soup=BeautifulSoup(response.text, 'html.parser') 
    site_title = html_soup.find("title").contents[0]
    if "Page Not Found - Stack Overflow" in site_title:
        reputation="NA"
    else:    
        reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")
        print(reputation)
7
  • 2
    Why are you doing this with a web crawler instead of using the Stack Exchange Data Explorer? Commented Nov 20, 2018 at 20:16
  • Duplicate of stackoverflow.com/questions/22786068/… Commented Nov 20, 2018 at 20:16
  • @Barmar with that I won't get this error? Commented Nov 20, 2018 at 20:40
  • You won't be accessing the webserver at all. Commented Nov 20, 2018 at 20:41
  • Can you please add the imports to source in the question. Commented Nov 20, 2018 at 20:46

2 Answers 2

1

I suggest using the Python time module and throwing a time.sleep(5) in your for loop. The error is coming from you making too many requests in too short a time period. You may have to play around with the actual sleep time to get it right, though.

Sign up to request clarification or add additional context in comments.

2 Comments

@enjal why do you think that?
Because it difficult to get the exact sleep time.
0

You can check if response.status_code == 429 and see if there is a value in the response telling you how long to wait for, then wait for the number of seconds you've been asked to.

I duplicated the issue here. I couldn't find any information on how long to wait in the content or the headers.

I suggest putting in some throttles and adjust until you're happy with the results.

See https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site for an example for getting user reputations from the Stack Exchange Data Explorer.

Example follows.

#!/usr/bin/env python

import time
import requests
from bs4 import BeautifulSoup

df={}
df['target']=[ ... ] # see https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site

throttle = 2
whoa = 450

with open('results.txt','w') as file_handler:
    file_handler.write('url\treputation\n')
    for id in df['target']:
        time.sleep(throttle)
        url='https://stackoverflow.com/users/'+str(id)
        print(url)
        response=requests.get(url)
        while response.status_code == 429:
            print(response.content)
            print(response.headers)
            time.sleep(whoa)
            response=requests.get(url)
        html_soup=BeautifulSoup(response.text, 'html.parser')
        site_title = html_soup.find("title").contents[0]
        if "Page Not Found - Stack Overflow" in site_title:
            reputation="NA"
        else:    
            reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")
        print('reputation: %s' % reputation)
        file_handler.write('%s\t%s\n' % (url,reputation))

Example error content.

<!DOCTYPE html>
<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <title>Too Many Requests - Stack Exchange</title>
    <style type="text/css">
        body
        {
            color: #333;
            font-family: 'Helvetica Neue', Arial, sans-serif;
            font-size: 14px;
            background: #fff url('img/bg-noise.png') repeat left top;
            line-height: 1.4;
        }
        h1
        {
            font-size: 170%;
            line-height: 34px;
            font-weight: normal;
        }
        a { color: #366fb3; }
        a:visited { color: #12457c; }
        .wrapper {
            width:960px;
            margin: 100px auto;
            text-align:left;
        }
        .msg {
            float: left;
            width: 700px;
            padding-top: 18px;
            margin-left: 18px;
        }
    </style>
</head>
<body>
    <div class="wrapper">
        <div style="float: left;">
            <img src="https://cdn.sstatic.net/stackexchange/img/apple-touch-icon.png" alt="Stack Exchange" />
        </div>
        <div class="msg">
            <h1>Too many requests</h1>
                        <p>This IP address (nnn.nnn.nnn.nnn) has performed an unusual high number of requests and has been temporarily rate limited. If you believe this to be in error, please contact us at <a href="mailto:[email protected]?Subject=Rate%20limiting%20of%20nnn.nnn.nnn.nnn%20(Request%20ID%3A%202158483152-SYD)">[email protected]</a>.</p>
                        <p>When contacting us, please include the following information in the email:</p>
                        <p>Method: rate limit</p>
                        <p>XID: 2158483152-SYD</p>
                        <p>IP: nnn.nnn.nnn.nnn</p>
                        <p>X-Forwarded-For: nnn.nnn.nnn.nnn</p>
                        <p>User-Agent: python-requests/2.20.1</p>
                        <p>Reason: Request rate.</p>
                        <p>Time: Tue, 20 Nov 2018 21:10:55 GMT</p>
                        <p>URL: stackoverflow.com/users/nnnnnnn</p>
                        <p>Browser Location: <span id="jslocation">(not loaded)</span></p>
        </div>
    </div>
    <script>document.getElementById('jslocation').innerHTML = window.location.href;</script>
</body>
</html>

Example error headers.

{ "Content-Length": "2054", "Via": "1.1 varnish", "X-Cache": "MISS", "X-DNS-Prefetch-Control": "off", "Accept-Ranges": "bytes", "X-Timer": "S1542748255.394076,VS0,VE0", "Server": "Varnish", "Retry-After": "0", "Connection": "close", "X-Served-By": "cache-syd18924-SYD", "X-Cache-Hits": "0", "Date": "Tue, 20 Nov 2018 21:10:55 GMT", "Content-Type": "text/html" }

9 Comments

the error headers don't tell how long to wait for which is sad. Thanks for the help but a wait time for 15 minutes is a lot. I was just wondering - if it just keeps on checking the response until response is not 429 and then breaks and does the normal operation. Is wait time still necessary then?
You can adjust the throttle amounts as you wish. If the throttle is set correctly you'll never hit 'whoa'. I'm not sure how long the rate limit is. Stack overflow should send back information saying wait x number of seconds.
I'm doing a run with throttle = 2, whoa = 450.
Which processed 500 urls with no issues.
It's still running. But I think this will get the job done. Thanks.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.