3

I'm trying to access search results on the NCBI Images search page (http://www.ncbi.nlm.nih.gov/images) in a script. I want to feed it a search term, report on all of the results, and then move on to the next search term. To do this I need to get to results pages after the first page, so I'm trying to use python mechanize to do it:

import mechanize
browser=mechanize.Browser()
page1=browser.open('http://www.ncbi.nlm.nih.gov/images?term=drug')
a=browser.links(text_regex='Next')
nextlink=a.next()
page2=browser.follow_link(nextlink)

This just gives me back the first page of search results again (in variable page2). What am I doing wrong, and how can I get to that second page and beyond?

1 Answer 1

6

Unfortunately that page uses Javascript to POST 2459 bytes of form variables to the server, just to navigate to a subsequent page. Here are a few of the variables (I count 38 vars in total):

EntrezSystem2.PEntrez.ImagesDb.Images_SearchBar.Term=drug
EntrezSystem2.PEntrez.ImagesDb.Images_SearchBar.CurrDb=images
EntrezSystem2.PEntrez.ImagesDb.Images_ResultsPanel.Entrez_Pager.CurrPage=2

You'll need to construct a POST request to the server containing some or all of these variables. Luckily if you get it working for page 2 you can simply increment CurrPage and send another POST to get each subsequent page of results (no need to extract links).

Update - That site is a total pain-in-the-ass, but here is a POST-based scrape of the 2-N pages. Set MAX_PAGE to the highest page number + 1. The script will produce files like file_000003.html.

Note: Before you use it, you need to replace POSTDATA with the contents of this paste blob (it expires in 1 month). It's just the body a POST request as captured by Firebug, which I use to seed the correct params:

import cookielib
import json
import mechanize
import sys
import urllib
import urlparse

MAX_PAGE = 6
TERM = 'drug'
DEBUG = False

base_url = 'http://www.ncbi.nlm.nih.gov/images?term=' + TERM
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.set_handle_referer(True)
browser.set_debug_http(DEBUG)
browser.set_debug_responses(DEBUG)
cjar = cookielib.CookieJar()
browser.set_cookiejar(cjar)

# make first GET request. this will populate the cookie
res = browser.open(base_url)

def write(num, data):
    with open('file_%06d.html' % num, 'wb') as out:
        out.write(data)

def encode(kvs):
    res = []
    for key, vals in kvs.iteritems():
        if isinstance(vals, list):
            for v in vals:
                res.append('%s=%s' % (key, urllib.quote(v)))
        else:
            res.append('%s=%s' % (key, urllib.quote(vals)))
    return '&'.join(res)

write(1, res.read())

# set this var equal to the contents of this: http://pastebin.com/UfejW3G0
POSTDATA = '''<post data>'''

# parse the embedded json vars into POST parameters
PREFIX1 = 'EntrezSystem2.PEntrez.ImagesDb.'
PREFIX2 = 'EntrezSystem2.PEntrez.DbConnector.'
params = dict((k, v[0]) for k, v in urlparse.parse_qs(POSTDATA).iteritems())

base_url = 'http://www.ncbi.nlm.nih.gov/images'
for page in range(2, MAX_PAGE):
    params[PREFIX1 + 'Images_ResultsPanel.Entrez_Pager.CurrPage'] = str(page)
    params[PREFIX1 + 'Images_ResultsPanel.Entrez_Pager.cPage'] = [str(page-1)]*2

    data = encode(params)
    req = mechanize.Request(base_url, data)
    cjar.add_cookie_header(req)
    req.add_header('Content-Type', 'application/x-www-form-urlencoded')
    req.add_header('Referer', base_url)
    res = browser.open(req)

    write(page, res.read())
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the help. I don't know any Javascript, so can I use some POST capability provided by mechanize or do I need to do something in Javascript?
By the way I have tried previously using wget and curl with POST data for those variables, and got basically the same behavior: always just getting the first page of results, no matter what CurrPage number I passed. It's perplexing.
You can submit a POST from mechanize. The site you're trying to scrape is doing some crazy stuff, so it may be a bit involved to work out exactly which variables need to be submitted. I'll update my answer with an example of doing a POST with mechanize, in case that helps.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.