-1

I have been trying to extract transaction records from this website: https://www.house730.com/en-us/deal/?type=rent.

Looking into stack overflow, I have stumbled into a solution that uses urllib.request + selenium.webdriver to download and render a webpage.

Something like this, in load_data.py :

from selenium import webdriver
from urllib.request import urlopen

import os

url = "https://www.house730.com/en-us/deal/?type=rent"
file_name = os.path.abspath(".") + "/tmp"

conn = urlopen(url)
data = conn.read()
conn.close()

file = open(file_name, "wb")
file.write(data)
file.close()

browser = webdriver.Firefox()
browser.get("file:///" + file_name)
html = browser.page_source
browser.quit()

print(html)

However, when I ran

python load_data.py > tmp.html

and open tmp.html. It seems the page crashes:

enter image description here

This also happens with wget.

wget "https://www.house730.com/en-us/deal/?type=rent" -O index.html

but they give different result html. Why?

Result from load_data.py:
https://gist.github.com/pond-nj/5fd51f81441463996ed20a8003981742#file-load_data_tmp-html

Result from wget :
https://gist.github.com/pond-nj/5fd51f81441463996ed20a8003981742#file-wget_index-html

Seems like wget already processed the html more than load_data.py. Because wget has a bunch of formatted records <div class="deal-data"/>. Why is this?

Also, this might be a hard question to ask. But what could be the reason that the page crashes when loaded html is open.

1 Answer 1

0

uses urllib.request + selenium.webdriver to download and render a webpage.

Later means execution of JavaScript code (if present) which might be used to modify. GNU wget does not support JavaScript execution, thus no effect of JavaScript code will take effect. You might try disabling JavaScript in Firefox as shown in How can I disable javascript in firefox with selenium? and then check result. If they still differ it might be caused by server providing response depending on User-Agent.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.