0

The problem is there is certain part in a website that can not be directly crawled through scrapy.Therefore, I need to use selenium to get the pagesource rendered so that I could get access to that certain contents.

I tried this:

from selenium import webdriver
driver = webdriver.Chrome()
driver.get(url)
print(driver.page_source)

I did not found that certain content in the result pagesource, though I could get the contents through driver.find_element_by_css_selector()

Why is this happen? and how to use selenium along with scrapy for crawling that certain website, one example is this: http://tieba.baidu.com/p/5513911529,

The part I have difficulties is in the picture below, within the red circle, I need the text content within it

Thank for your help, or at least point me a documentation to read.

I need the text in the red circle

8
  • Please read why a screenshot of HTML or code or error is a bad idea. Consider updating the Question with formatted text based HTML and code trials. Commented Jan 21, 2018 at 13:39
  • The part I have difficulties... Can you elaborate about those difficulties? Commented Jan 21, 2018 at 13:42
  • @Andersson i can not get those content directly using scrapy css selectors,that part of contents are not existed from the view of scrapy,maybe because it is a dynamic page? Commented Jan 21, 2018 at 13:48
  • No. This content seem to be static. Can you share your CSS selector? Commented Jan 21, 2018 at 13:57
  • @Andersson Yes, here it is: '#j_p_postlist > div:nth-child(16) > div.d_post_content_main > div.core_reply.j_lzl_wrapper > div.j_lzl_container.core_reply_wrapper > div.j_lzl_c_b_a.core_reply_content > ul > li:nth-child(4) > div > span'. I am able to use the exact same selector with selenium.webdriver.find_element_by_css_selector() method to get the texts, but not with scrapy's response.css() method. Commented Jan 21, 2018 at 14:42

1 Answer 1

2

The content will be displayed after the users scroll down. So you have to use JS Executor to scroll down. See my code below.

driver.get('http://tieba.baidu.com/p/5513911529')
SCROLL_PAUSE_TIME = 0.5
SCROLL_LENGTH = 200
page_height = int(driver.execute_script("return document.body.scrollHeight"))
scrollPosition = 0
while scrollPosition < page_height:
    scrollPosition = scrollPosition + SCROLL_LENGTH
    driver.execute_script("window.scrollTo(0, " + str(scrollPosition) + ");")
    time.sleep(SCROLL_PAUSE_TIME)

time.sleep(5)
print(driver.page_source)
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you! But I can only found the pagesource for the part of content at the bottom, content in the middle section of the webpage is not there, does it means that I need to scroll down a bit by bit?
Thanks, I think it solved my problem. But I still have 2 questions, 1. Since it needs to scroll down a bit by bit, it appears to be a bit slow, I need to build a crawler so is it possible to make it faster? 2. Why do I need to scroll down gradually to get the contents, which web tech does it related to, I am a noob in this field so a little bit of explanation or pointing me somewhere to look for will be of great help. Thanks again!
@YoarkYANG 1. You just need to increase scroll length (e.g. SCROLL_LENGTH = 600). 2. The webapp check current position of scroll, if it reach the comment component, it will display the comments. This will improve performance of the web. By the way, can you click accept this answer so we can close this question.
Thanks a lot! It helps!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.