How to use selenium with scrapy for crawling a certain webpage?

Question

The problem is there is certain part in a website that can not be directly crawled through scrapy.Therefore, I need to use selenium to get the pagesource rendered so that I could get access to that certain contents.

I tried this:

from selenium import webdriver
driver = webdriver.Chrome()
driver.get(url)
print(driver.page_source)

I did not found that certain content in the result pagesource, though I could get the contents through driver.find_element_by_css_selector()

Why is this happen? and how to use selenium along with scrapy for crawling that certain website, one example is this: http://tieba.baidu.com/p/5513911529,

The part I have difficulties is in the picture below, within the red circle, I need the text content within it

Thank for your help, or at least point me a documentation to read.

Please read why a screenshot of HTML or code or error is a bad idea. Consider updating the Question with formatted text based HTML and code trials. — undetected Selenium
– undetected Selenium, Commented Jan 21, 2018 at 13:39
The part I have difficulties... Can you elaborate about those difficulties? — Andersson
– Andersson, Commented Jan 21, 2018 at 13:42
@Andersson i can not get those content directly using scrapy css selectors，that part of contents are not existed from the view of scrapy，maybe because it is a dynamic page? — YoarkYANG
– YoarkYANG, Commented Jan 21, 2018 at 13:48
No. This content seem to be static. Can you share your CSS selector? — Andersson
– Andersson, Commented Jan 21, 2018 at 13:57
@Andersson Yes, here it is: '#j_p_postlist > div:nth-child(16) > div.d_post_content_main > div.core_reply.j_lzl_wrapper > div.j_lzl_container.core_reply_wrapper > div.j_lzl_c_b_a.core_reply_content > ul > li:nth-child(4) > div > span'. I am able to use the exact same selector with selenium.webdriver.find_element_by_css_selector() method to get the texts, but not with scrapy's response.css() method. — YoarkYANG
– YoarkYANG, Commented Jan 21, 2018 at 14:42

Buaban · Accepted Answer · 2018-01-24 06:08:10Z

2

The content will be displayed after the users scroll down. So you have to use JS Executor to scroll down. See my code below.

driver.get('http://tieba.baidu.com/p/5513911529')
SCROLL_PAUSE_TIME = 0.5
SCROLL_LENGTH = 200
page_height = int(driver.execute_script("return document.body.scrollHeight"))
scrollPosition = 0
while scrollPosition < page_height:
    scrollPosition = scrollPosition + SCROLL_LENGTH
    driver.execute_script("window.scrollTo(0, " + str(scrollPosition) + ");")
    time.sleep(SCROLL_PAUSE_TIME)

time.sleep(5)
print(driver.page_source)

edited Jan 24, 2018 at 6:08

answered Jan 22, 2018 at 10:41

Buaban

5,1271 gold badge20 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

YoarkYANG Over a year ago

Thank you! But I can only found the pagesource for the part of content at the bottom, content in the middle section of the webpage is not there, does it means that I need to scroll down a bit by bit?

YoarkYANG Over a year ago

Thanks, I think it solved my problem. But I still have 2 questions, 1. Since it needs to scroll down a bit by bit, it appears to be a bit slow, I need to build a crawler so is it possible to make it faster? 2. Why do I need to scroll down gradually to get the contents, which web tech does it related to, I am a noob in this field so a little bit of explanation or pointing me somewhere to look for will be of great help. Thanks again!

Buaban Over a year ago

@YoarkYANG 1. You just need to increase scroll length (e.g. SCROLL_LENGTH = 600). 2. The webapp check current position of scroll, if it reach the comment component, it will display the comments. This will improve performance of the web. By the way, can you click accept this answer so we can close this question.

YoarkYANG Over a year ago

Thanks a lot! It helps!

Collectives™ on Stack Overflow

How to use selenium with scrapy for crawling a certain webpage?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related