0

I am looking to extract generated content from a web page.

I am using the library requests in python 3 to return the page as below

 import requests 
 url = "https://app.updateimpact.com/treeof/org.json4s/json4s- 
  native_2.11/3.5.2"

 html_doc = requests.get(url)
 print(html_doc.text)

The retrieve text seems to be just padding though. What tools should I be looking at to drill into the content and extract the info there ?

2 Answers 2

1

Javascript needs to run on the page to provide much of the content. Using a method like selenium will allow this to run. Note that an additional wait condition is needed to ensure certain content is loaded. You can then use selenium syntax to extract info or dump the html from page_source into BeautifulSoup.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs

d = webdriver.Chrome()
d.get('https://app.updateimpact.com/treeof/org.json4s/json4s-native_2.11/3.5.2')
dependencies = WebDriverWait(d, 5).until(EC.presence_of_element_located((By.CSS_SELECTOR , '.stats-list')))
print(dependencies)
soup = bs(d.page_source, 'lxml')
print(soup.select_one('#tree').text) # example
Sign up to request clarification or add additional context in comments.

7 Comments

Interesting. There must be some sort of cookie that contains a timer because the url works for awhile
I just noticed that there is an error in your URL , there should be no gap between json4s- || native 'app.updateimpact.com/treeof/org.json4s/json4s-native_2.11/3.5.2'
That was careless of me! Thanks. Lemme have another look.
that looks to be on point. Although I am having issues running it as the path to my Chrome driver is throwing errors. I am pointing to its location in the python site-package but will need to google it more to see why it no works. Will mark solution as answer as soon as I can confirm it. :)
|
0

If the content is html, you could look into:

If it's json, you would use:

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.