Mechanize and Python, clicking href="javascript:void(0);" links and getting the response back

Question

I need to scrap some data from page, where I fill out the form (already did this with mechanize). The problem is, the page returns data on many pages, and I have troubles from getting the data from those pages.

There's no problem to get them from the first result page, since it displays already after the search - I simply submit the form and get the response.

I analyzed the source code of the results page and it seems it uses Java Script, RichFaces (some lib for JSF with ajax but I can be wrong since I am not a web expert).

However, I managed to figure out how to get to the remaining result pages. I need to click links which are in this form (href="javascript:void(0);", full code below):

<td class="pageNumber"><span class="rf-ds " id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233"><span class="rf-ds-nmb-btn rf-ds-act " id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_1">1</span><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_2">2</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_3">3</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_4">4</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_5">5</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_6">6</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_7">7</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_8">8</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_9">9</a><a class="rf-ds-nmb-btn " href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_10">10</a><a class="rf-ds-btn rf-ds-btn-next" href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_next">»</a><a class="rf-ds-btn rf-ds-btn-last" href="javascript:void(0);" id="SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_l">»»»»</a>

<script type="text/javascript">new RichFaces.ui.DataScroller("SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233",function(event,element,data){RichFaces.ajax("SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233",event,{"parameters":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233:page":data.page} ,"incId":"1"} )},{"digitals":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_9":"9","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_8":"8","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_7":"7","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_6":"6","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_5":"5","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_4":"4","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_3":"3","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_1":"1","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_10":"10","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_2":"2"} ,"buttons":{"right":{"SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_next":"next","SomeSimpleForm:SomeSimpleTable:j_idt211:j_idt233_ds_l":"last"} } ,"currentPage":1} )</script></span></td>
<td class="pageExport"><script type="text/javascript" src="/opi/javax.faces.resource/download.js?ln=js/component&amp;b="></script><script type="text/javascript">

So I would like to ask if there's a way to click all the links and get all the pages using mechanize (note, that after » symbol there are more pages available)? I ask about answers for total dummies with web knowledge :)

Community · Accepted Answer · 2017-05-23 12:23:55Z

4

+200

First of all, I would still stick to selenium since this is a quite "javascript-heavy" website. Note that you can use a headless browser (PhantomJS or with a virtual display) if needed.

The idea here would be to paginate by 100 rows per page, click on the ">>" link until it is not present on page, which would mean we've hit the last page and there are no more results to process. In order to make the solution reliable we need to use Explicit Waits: every time we proceed to a next page - wait for invisibility of the loading spinner.

Working implementation:

# -*- coding: utf-8 -*-
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.maximize_window()

driver.get('https://polon.nauka.gov.pl/opi/aa/drh/zestawienie?execution=e1s1')
wait = WebDriverWait(driver, 30)

# paginate by 100
select = Select(driver.find_element_by_id("drhPageForm:drhPageTable:j_idt211:j_idt214:j_idt220"))
select.select_by_visible_text("100")

while True:
    # wait until there is no loading spinner
    wait.until(EC.invisibility_of_element_located((By.ID, "loadingPopup_content_scroller")))

    current_page = driver.find_element_by_class_name("rf-ds-act").text
    print("Current page: %d" % current_page)

    # TODO: collect the results

    # proceed to the next page
    try:
        next_page = driver.find_element_by_link_text(u"»")
        next_page.click()
    except NoSuchElementException:
        break

edited May 23, 2017 at 12:23

CommunityBot

11 silver badge

answered Jul 18, 2015 at 13:23

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

yak Over a year ago

It seems your solution is better. I opened a new bounty to thank you for your answer :)

alecxe Over a year ago

@yak wow, thanks so much for it. Glad the answer helped to solve the problem.

PascalVKooten · Accepted Answer · 2015-07-18 14:46:02Z

2

+100

This works for me: it seems all the html is available in page

import time    
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://polon.nauka.gov.pl/opi/aa/drh/zestawienie')

next_id = 'drhPageForm:drhPageTable:j_idt211:j_idt233_ds_next'

pages = []
it = 0
while it < 1795:
    time.sleep(1)
    it += 1
    bad = True
    while bad:
        try:
            driver.find_element_by_id(next_id).click()
            bad = False 
        except:
            print('retry')

    page = driver.page_source

    pages.append(page)

Instead of first collecting and storing all html, you could also just query what you want, but you'll need lxml or BeautifulSoup for that.

EDIT: After running it indeed I noticed we got a mistake. It was simple to just catch the exception and retry.

edited Jul 18, 2015 at 14:46

answered Jul 18, 2015 at 12:42

PascalVKooten

21.6k18 gold badges115 silver badges169 bronze badges

5 Comments

yak Over a year ago

Thank you so much for help :) I will try it in a while. Yeah, I agree, but BeautifulSoup is not a problem, I used it before, so I think I will handle it. However, I had problems with send_keys method, because after I automatically (from the source code) clicked the Search (Wyszukaj) button, page automatically cleared the criteria. Meh, who cares, if your approach will work, I will simply use BS4 for parsing.

yak Over a year ago

Oh, I just noticed, you're THE GUY from yagmail - used your tool, I just wanted to thank you for this, its awesome!

PascalVKooten Over a year ago

Good luck! Pretty sure it will work :) Indeed, it's weird what exactly the page does, but simply retrying the element works... Also, if you want to be friendly to the page and be patient, feel free to add more delay.

PascalVKooten Over a year ago

@yak Hah, so cool to be called "THE GUY"; you're very welcome!

yak Over a year ago

Partly. I'm using your solution but it seems that it somehow 'repeats' pages and downloads some of them twice. However, I don't think it's a huge problem, I can handle this further, when parsing. Cheers :)

Collectives™ on Stack Overflow

Mechanize and Python, clicking href="javascript:void(0);" links and getting the response back

2 Answers 2

2 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related