2

I'm trying to scrap dynamic content from a Blog through Selenium but it always returns un rendered JavaScript.

To test this behavior I tried to wait till iframe loads completely and printed it's content which prints fine but again when I move back to parent frame it just displays un rendered JavaScript.

I'm looking for something in which I'm able to print completely rendered HTML content

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions

driver = webdriver.Chrome("path to chrome driver")   
driver.get('http://justgivemechocolateandnobodygetshurt.blogspot.com/')

WebDriverWait(driver, 40).until(expected_conditions.frame_to_be_available_and_switch_to_it((By.ID, "navbar-iframe")))

# Rendered iframe HTML is printed.
content = driver.page_source
print content.encode("utf-8")

# When I switch back to parent frame it again prints non rendered JavaScript.
driver.switch_to.parent_frame()
content = driver.page_source
print content.encode("utf-8")
6
  • because .page_source returns the source, not the DOM Commented Apr 21, 2016 at 20:03
  • @Fabricator How can I get the updated DOM? Commented Apr 21, 2016 at 20:23
  • @UmarIqbal, Have you tried selecting the element using one of the find_element methods? Commented Apr 21, 2016 at 20:29
  • I think any of the find_element* commands should do Commented Apr 21, 2016 at 20:29
  • or you can just execute javascript code Commented Apr 21, 2016 at 20:30

1 Answer 1

4

The problem is - the .page_source works only in the current context. There is that "current top-level browsing context" notation..Meaning, if you would call it on a default content - you would not get the inner HTML of the child iframeelements - for that you would have to switch into the context of a frame and call .page_source.

In other words, to get the very complete HTML of the page including the page source of the iframes, you would have to switch into the iframe contexts one by one and get the sources separately.

See also:


Old answer:

I would wait for at least one blog entry to be loaded before getting the page_source:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 40)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".entry-content")))

print(driver.page_source)
Sign up to request clarification or add additional context in comments.

8 Comments

Doesn't matter, still returns the old DOM.
@UmarIqbal okay, what do you mean by the old DOM? And what is your desired output?
by old DOM I meant un rendered JavaScript. All I want is a completely rendered HTML content.
@UmarIqbal thanks, could you be more specific and point to an, perhaps, element you don't want to see in the page source? Note that even if I go to the website, wait for it to load and inspect the page source - I would still see the script tags with javascript there.
Can you try running my code? the first print statement prints the dynamically loaded iframe. After that in second print statement I print the complete page source, It's supposed to print complete DOM along with that iframe but it doesn't.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.