Too many requests error while crawling users reputation from Stack Overflow

Question

I have a list of user ids and I'm interested in crawling their reputation.

I wrote a script using beautifulsoup that crawls users reputation. But the problem is, I get Too many requests error when my script has run for less than a minute. After that, I am unable to open the Stack Overflow manually on browser too.

My question is, how do I crawl the reputation without getting too many request error?

My code is given below:

for id in df['target']:
    url='https://stackoverflow.com/users/'+str(id)
    print(url)
    response=get(url)
    html_soup=BeautifulSoup(response.text, 'html.parser') 
    site_title = html_soup.find("title").contents[0]
    if "Page Not Found - Stack Overflow" in site_title:
        reputation="NA"
    else:    
        reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")
        print(reputation)

Why are you doing this with a web crawler instead of using the Stack Exchange Data Explorer? — Barmar
– Barmar, Commented Nov 20, 2018 at 20:16

ahota · Accepted Answer · 2018-11-20 20:12:08Z

1

I suggest using the Python time module and throwing a time.sleep(5) in your for loop. The error is coming from you making too many requests in too short a time period. You may have to play around with the actual sleep time to get it right, though.

answered Nov 20, 2018 at 20:12

ahota

4591 gold badge5 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ahota Over a year ago

@enjal why do you think that?

nzy Over a year ago

Because it difficult to get the exact sleep time.

Keith John Hutchison · Accepted Answer · 2018-11-20 22:49:16Z

0

You can check if response.status_code == 429 and see if there is a value in the response telling you how long to wait for, then wait for the number of seconds you've been asked to.

I duplicated the issue here. I couldn't find any information on how long to wait in the content or the headers.

I suggest putting in some throttles and adjust until you're happy with the results.

See https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site for an example for getting user reputations from the Stack Exchange Data Explorer.

Example follows.

#!/usr/bin/env python

import time
import requests
from bs4 import BeautifulSoup

df={}
df['target']=[ ... ] # see https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site

throttle = 2
whoa = 450

with open('results.txt','w') as file_handler:
    file_handler.write('url\treputation\n')
    for id in df['target']:
        time.sleep(throttle)
        url='https://stackoverflow.com/users/'+str(id)
        print(url)
        response=requests.get(url)
        while response.status_code == 429:
            print(response.content)
            print(response.headers)
            time.sleep(whoa)
            response=requests.get(url)
        html_soup=BeautifulSoup(response.text, 'html.parser')
        site_title = html_soup.find("title").contents[0]
        if "Page Not Found - Stack Overflow" in site_title:
            reputation="NA"
        else:    
            reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")
        print('reputation: %s' % reputation)
        file_handler.write('%s\t%s\n' % (url,reputation))

Example error content.

<!DOCTYPE html>
<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <title>Too Many Requests - Stack Exchange</title>
    <style type="text/css">
        body
        {
            color: #333;
            font-family: 'Helvetica Neue', Arial, sans-serif;
            font-size: 14px;
            background: #fff url('img/bg-noise.png') repeat left top;
            line-height: 1.4;
        }
        h1
        {
            font-size: 170%;
            line-height: 34px;
            font-weight: normal;
        }
        a { color: #366fb3; }
        a:visited { color: #12457c; }
        .wrapper {
            width:960px;
            margin: 100px auto;
            text-align:left;
        }
        .msg {
            float: left;
            width: 700px;
            padding-top: 18px;
            margin-left: 18px;
        }
    </style>
</head>
<body>
    <div class="wrapper">
        <div style="float: left;">
            <img src="https://cdn.sstatic.net/stackexchange/img/apple-touch-icon.png" alt="Stack Exchange" />
        </div>
        <div class="msg">
            <h1>Too many requests</h1>
                        <p>This IP address (nnn.nnn.nnn.nnn) has performed an unusual high number of requests and has been temporarily rate limited. If you believe this to be in error, please contact us at <a href="mailto:[email protected]?Subject=Rate%20limiting%20of%20nnn.nnn.nnn.nnn%20(Request%20ID%3A%202158483152-SYD)">[email protected]</a>.</p>
                        <p>When contacting us, please include the following information in the email:</p>
                        <p>Method: rate limit</p>
                        <p>XID: 2158483152-SYD</p>
                        <p>IP: nnn.nnn.nnn.nnn</p>
                        <p>X-Forwarded-For: nnn.nnn.nnn.nnn</p>
                        <p>User-Agent: python-requests/2.20.1</p>
                        <p>Reason: Request rate.</p>
                        <p>Time: Tue, 20 Nov 2018 21:10:55 GMT</p>
                        <p>URL: stackoverflow.com/users/nnnnnnn</p>
                        <p>Browser Location: <span id="jslocation">(not loaded)</span></p>
        </div>
    </div>
    <script>document.getElementById('jslocation').innerHTML = window.location.href;</script>
</body>
</html>

Example error headers.

{ "Content-Length": "2054", "Via": "1.1 varnish", "X-Cache": "MISS", "X-DNS-Prefetch-Control": "off", "Accept-Ranges": "bytes", "X-Timer": "S1542748255.394076,VS0,VE0", "Server": "Varnish", "Retry-After": "0", "Connection": "close", "X-Served-By": "cache-syd18924-SYD", "X-Cache-Hits": "0", "Date": "Tue, 20 Nov 2018 21:10:55 GMT", "Content-Type": "text/html" }

edited Nov 20, 2018 at 22:49

answered Nov 20, 2018 at 20:20

Keith John Hutchison

5,34711 gold badges49 silver badges66 bronze badges

9 Comments

nzy Over a year ago

the error headers don't tell how long to wait for which is sad. Thanks for the help but a wait time for 15 minutes is a lot. I was just wondering - if it just keeps on checking the response until response is not 429 and then breaks and does the normal operation. Is wait time still necessary then?

Keith John Hutchison Over a year ago

You can adjust the throttle amounts as you wish. If the throttle is set correctly you'll never hit 'whoa'. I'm not sure how long the rate limit is. Stack overflow should send back information saying wait x number of seconds.

Keith John Hutchison Over a year ago

I'm doing a run with throttle = 2, whoa = 450.

Keith John Hutchison Over a year ago

Which processed 500 urls with no issues.

nzy Over a year ago

It's still running. But I think this will get the job done. Thanks.

|

Collectives™ on Stack Overflow

Too many requests error while crawling users reputation from Stack Overflow

2 Answers 2

2 Comments

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related