Scraping Issue with Python 3.6 as returning only first page

Question

I am trying to get data using Python from a public site. On that site the there are different type of searches. There is a search type that is by letter. when i search it with the letter 'A' it sends a GET requests to page that returns a response from below url.

http://www.museumsusa.org/museums/?k=1271393%2cAlpha%3aA%3bDirectoryID%3a200454

but it display the first page. I get all the data on the first page. But when i click on the second page. It sends a get request that is by _postback function o JavaScript to the same url that is used for the GET request but with different parameters.

data={
'__EVENTTARGET':"ctl08$ctl00$BottomPager$Page2",
'__EVENTARGUMENT':"",
'__VIEWSTATE':VIEWSTATE,
'__EVENTVALIDATION':EVENTVALIDATION,
'ctl04$phrase':"",
'ctl04$directoryList':"/museums/|/museums/search/"

In __EVENTTARGET it sends a page name. I have successfully got the VIEWSTATE value and EVENTVALIDATION. But whenever is send a post request i always get the first page. This is my complete code.

import requests
import json
from bs4 import BeautifulSoup
import urllib



url="http://www.museumsusa.org/museums/?k=1271393%2cAlpha%3aA%3bDirectoryID%3a200454";
headers={
    "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) "
                 "Chrome/60.0.3112.101 Safari/537.36",
    "Content-Type":"application/x-www-form-urlencoded"}

session = requests.Session()
session.headers.update(headers)
r=session.get(url)
soup=BeautifulSoup(r.content)
#?k=1271393%2cAlpha%3aA%3bDirectoryID%3a200454
VIEWSTATE=soup.find(id="__VIEWSTATE")['value']
#VIEWSTATEGENERATOR=soup.find(id="__VIEWSTATEGENERATOR")['value']
EVENTVALIDATION=soup.find(id="__EVENTVALIDATION")['value']


data_in={
'__EVENTTARGET':"ctl08$ctl00$BottomPager$Page2",
'__EVENTARGUMENT':"",
'__VIEWSTATE':VIEWSTATE,
'__EVENTVALIDATION':EVENTVALIDATION,
'ctl04$phrase':"",
'ctl04$directoryList':"/museums/|/museums/search/"
#"k":"1271393,Alpha:A;DirectoryID:200454"
      }


r2 = session.post(url, data=json.dumps(data_in))

print (r2)

How can i get the data form different pages because this script always returns me data of the first page. No matter what number if try. I am using Python 3.6 on Mac OSX

t.m.adam · Accepted Answer · 2017-08-19 19:05:25Z

1

You can go to the next page if you change the value of data_in['__EVENTTARGET'] to "ctl08$ctl00$BottomPager$Next". Then use a for loop to get a specific number of pages, eg 10

url = "http://www.museumsusa.org/museums/?k=1271393%2cAlpha%3aA%3bDirectoryID%3a200454"
headers={
    "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko)"
}
session = requests.Session()
session.headers.update(headers)
r=session.get(url)
pages = 10

for _ in range(pages):
    soup=BeautifulSoup(r.content, 'html.parser')
    VIEWSTATE=soup.find(id="__VIEWSTATE")['value']
    EVENTVALIDATION=soup.find(id="__EVENTVALIDATION")['value']
    data_in={
        '__EVENTTARGET':'ctl08$ctl00$BottomPager$Next',
        '__EVENTARGUMENT':"",
        '__VIEWSTATE':VIEWSTATE,
        '__EVENTVALIDATION':EVENTVALIDATION,
        'ctl04$phrase':"",
        'ctl04$directoryList':"/museums/|/museums/search/"
    }
    r = session.post(url, data=data_in)

answered Aug 19, 2017 at 19:05

t.m.adam

15.4k3 gold badges34 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ahsan Mukhtar Over a year ago

let me try that

Ahsan Mukhtar Over a year ago

Is there a command for previous ?ctl08$ctl00$BottomPager$Previous something like that.

Ahsan Mukhtar Over a year ago

what if i have to go two pages ahead

t.m.adam Over a year ago

I'm afraid the site won't allow you to select an arbitrary page ( eg: "ctl08$ctl00$BottomPager$Page4" ). If you want to skip a page you can just ignore the result; you can track the current page with the value of _ + 1. However you can go back a page with "ctl08$ctl00$BottomPager$Prev".

Collectives™ on Stack Overflow

Scraping Issue with Python 3.6 as returning only first page

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related