0

The following code returns only 20 questions/results. How can I retrieve the whole number of questions for that date?

base_url = 'https://api.stackexchange.com/2.3'
endpoint = '/questions'
params = {
    'site': 'stackoverflow',
    'tagged': tag,
    'fromdate': start_date,
    'todate': end_date,
    'filter': 'default'  # Use 'default' or specify your desired filter
}

response = requests.get(base_url + endpoint, params=params)
data = response.json()
5
  • 2
    see page and pagesize parameters here Commented Jun 8, 2023 at 6:36
  • thank you - can you please elaborate? I looked at the page, but I can't figure out how to tackle the problem. Many thanks! Commented Jun 8, 2023 at 6:48
  • 1
    What is the problem? All APIs use paging for scalability reasons. Returning a ton of results means a lot of RAM and a lot of IO is used, while the database remains locked for a long time. All APIs use some kind of paging instead Commented Jun 8, 2023 at 7:14
  • "I looked at the page, but I can't figure out how to tackle the problem". To be clear: the problem is that when you use the API, you only get 20 results at a time, right? See how it shows you that there is a pagesize parameter you can use when you make the API query? What happens if you try using different values for that? Do you see how that corresponds to the number of results you get? (What do you suppose a "page" might refer to, in this context? What would it mean if the "size" of the "page" changes?) Commented Jun 9, 2023 at 3:59
  • every query to the API returns a 'has_more' boolean, so that one can increase the page and get the next batch of results. the question has been answered. Commented Jun 9, 2023 at 9:36

1 Answer 1

0

according to page doc pagesize can be any value between 0 and 100 and defaults to 30. if with the default values you only get 20 questions, it's probably because there are only this many questions fitting your tag in the time span given (can't tell as it's not included) otherwise you would get 30 results and would need to paginate through the different pages of results with page param like so

base_url = 'https://api.stackexchange.com/2.3'
endpoint = '/questions'
page = 1
pagesize = 30
page_results = []
while (page == 1 or page_results["has_more"] == True) :
  params = {
    'site': 'stackoverflow',
    'page': page,
    'pagesize' : pagesize,
    'tagged': tag,
    'fromdate': start_date,
    'todate': end_date,
    'filter': 'default'  # Use 'default' or specify your desired filter
  }
  page_results = requests.get(base_url + endpoint, params=params).json()
  page +=1
Sign up to request clarification or add additional context in comments.

16 Comments

There are no such operators as || and ++ in Python.
@Dimitris you can't get all questions, in any API, not just StackOverflow's. Unless the API's creator knows only a few items will be returned, everyone implements some kind of paging.
No, what you tried to do would be very clunky. It would prevent you from asking this question because the server would be frozen trying to return then 1M questions per day
All the questions for all time, one day at a time, is still all questions, so you'll end up downloading everything. If you check the actual data dump you'll see the actual English SO Posts file is 18GB zipped. You'll probably download that file faster than trying to retrieve the same contents through the API. Download tools can easily recover from network problems, download in parallel or retry in chunks
@safir The len(page_results) == pagesize isn't the best approach: page_results["has_more"] gives you a True if there are more items (and False otherwise).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.