Extracting additional Content python requests

Question

I am looking to extract generated content from a web page.

I am using the library requests in python 3 to return the page as below

 import requests 
 url = "https://app.updateimpact.com/treeof/org.json4s/json4s- 
  native_2.11/3.5.2"

 html_doc = requests.get(url)
 print(html_doc.text)

The retrieve text seems to be just padding though. What tools should I be looking at to drill into the content and extract the info there ?

QHarr · Accepted Answer · 2019-02-18 09:47:54Z

1

Javascript needs to run on the page to provide much of the content. Using a method like selenium will allow this to run. Note that an additional wait condition is needed to ensure certain content is loaded. You can then use selenium syntax to extract info or dump the html from page_source into BeautifulSoup.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs

d = webdriver.Chrome()
d.get('https://app.updateimpact.com/treeof/org.json4s/json4s-native_2.11/3.5.2')
dependencies = WebDriverWait(d, 5).until(EC.presence_of_element_located((By.CSS_SELECTOR , '.stats-list')))
print(dependencies)
soup = bs(d.page_source, 'lxml')
print(soup.select_one('#tree').text) # example

edited Feb 18, 2019 at 9:47

answered Feb 16, 2019 at 18:04

QHarr

84.5k14 gold badges58 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Steve Over a year ago

Interesting. There must be some sort of cookie that contains a timer because the url works for awhile

Steve Over a year ago

I just noticed that there is an error in your URL , there should be no gap between json4s- || native 'app.updateimpact.com/treeof/org.json4s/json4s-native_2.11/3.5.2'

QHarr Over a year ago

That was careless of me! Thanks. Lemme have another look.

Steve Over a year ago

that looks to be on point. Although I am having issues running it as the path to my Chrome driver is throwing errors. I am pointing to its location in the python site-package but will need to google it more to see why it no works. Will mark solution as answer as soon as I can confirm it. :)

QHarr Over a year ago

selenium-python.readthedocs.io/# and crummy.com/software/BeautifulSoup/bs4/doc/#

|

Hugo Mota · Accepted Answer · 2019-02-16 16:29:57Z

0

If the content is html, you could look into:

If it's json, you would use:

https://docs.python.org/3/library/json.html

answered Feb 16, 2019 at 16:29

Hugo Mota

11.7k9 gold badges45 silver badges61 bronze badges

Collectives™ on Stack Overflow

Extracting additional Content python requests

2 Answers 2

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related