3

I'm a newbie to scraping. I'm trying to scrape the value from this site with button Buy Now.
Option I've tried is:

from PyQt4.QtGui import QApplication
from PyQt4.QtCore import QUrl
from PyQt4.QtWebKit import QWebPage

class Client(QWebPage):
    def __init__(self):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        # self.loadFinished.connect(self.on_page_load)
        # self.mainFrame().load(QUrl(url))
        # self.app.exec_()
    def on_page_load(self):
        self.app.quit()
    def mypage(self, url):
        self.loadFinished.connect(self.on_page_load)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()
client_response = Client()
def parse(url):                # OSRS + RS3
    client_response.mypage(url)
    source = client_response.mainFrame().toHtml()
    soup = BeautifulSoup(source, 'html.parser')
    osrs_text = soup.findAll('input', attrs={'type': 'number'})
    quantity = (osrs_text[0])['min']
    final = 0
    if(quantity == '1'):
        final_osrs = round(float(soup.findAll('span', attrs={'id':'goldprice'})[0].text),3)
        print(final_osrs)

    else:
        price = round(float(soup.findAll('span', attrs={'id':'goldprice'})[0].text),3)
        final_rs3 = price/int(quantity)
        print(final_rs3)

This approach is not good because it's taking too much time to scrape. I also tried Selenium Approach but that's also not needed at the moment.
Can u guys please suggest me the better way to scrape the value? Here is what I need. Any help will highly be appreciated. Thanks.



P.S: I tried this library because the content was dynamically generated.

1
  • for a new contributor this is a good question. + 1. Remember to use the snippet tool via edit to insert html. Optimization questions may also be candidates for code review site - though be sure to read their guidance before posting. Commented Mar 26, 2019 at 7:42

1 Answer 1

3

I am not sure how much difference in performance you will get, but you can try and check this solution.

import requests
from bs4 import BeautifulSoup

baseUrl = 'https://www.rsmalls.com/osrs-gold'
postUrl = 'https://www.rsmalls.com/index.php?route=common/quickbuy/rsdetail'

with requests.Session() as session:
    res = session.get(baseUrl)
    soup = BeautifulSoup(res.text, 'lxml')
    game_id = soup.select_one("#choose-game > option[selected]")['value']
    response = session.post(postUrl, data={'game_id': game_id}).json()
    print(f"{'Gold Price:'} {response['price']}")

In this code, first I am getting the id of "Runescape 2007", just in case if the website owner changes it. You may skip that step and directly provide value '345' as id to next post request, if you are sure that it will not change.

The price is loaded with JS code as you mentioned. Using browser dev tools, I could get the actual POST request made to get the price, which requires the id selected from dropdown. The POST request to https://www.rsmalls.com/index.php?route=common/quickbuy/rsdetail, gives a json response like:

{"success":true,"product_id":"30730","price":0.85,"server_id":"1661","server_option":"463","quantity":"1|5|10|20|50|100|200|300|500|1000|1500|2000","name":"M"}

So, I have parsed the response as json and got the price from it.
Let me know if you have any questions.

EDIT:

There is different POST request made on https://rsmalls.com/runescape3-gold, so the same solution doesn't work. The POST request can be different for each page/website/data. You can find such post request by yourself using browser devtools as shown here. In the right, where you can see that POST request to a URL is made, at the bottom you will find the data sent to POST request as well. Also note that, in the response to this request, it is always replying with price of 1 unit, so it may not match if the default number of units on website is more than 1(like 5 in below screenshot).

enter image description here

Sign up to request clarification or add additional context in comments.

3 Comments

Additional side note. You wouldn’t even need to import json, as requests already has that function built in. So you could combine those 2 lines to: 'json_res = sea.post(...).json()'
But for this site rsmalls.com/runescape3-gold it's not working.
And thanks @chitown88 and SIM for improving my code.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.