1

I'm creating a script where I'm trying to rip m4a files from a website specifically. I'm using BS4 and selenium for this purpose presently.

I'm having some trouble getting the info. The file link is not located in the HTML source for the page. Instead, I can only find it in the console. The link I'm trying to get is here in this image (https://i.sstatic.net/5rUJH.jpg) labeled "audio_url_m4a:".

Here's some sample code I'm using:

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities\

d = DesiredCapabilities.CHROME
d['loggingPrefs'] = {'browser':'ALL ' }
driver = webdriver.Chrome(r'chromedriver path', desired_capabilities = d)

~~lots of code doing other things not relevant to the post~~

for URL in audm_URL: #this is referencing a line of code where I construct a list of URLs
            driver.get(audm)
            time.sleep(3)

            for entry in driver.get_log('browser'):
                print(entry)

Here is the output I get:


{'level': 'SEVERE', 'message': 'https://audm.herokuapp.com/favicon.ico - Failed to load resource: the server responded with a status of 404 (Not Found)', 'source': 'network', 'timestamp': 1611291689357}
{'level': 'SEVERE', 'message': 'https://cdn.segment.com/analytics.js/v1/5DOhLj2nIgYtQeSfn9YF5gpAiPqRtWSc/analytics.min.js - Failed to load resource: net::ERR_NAME_NOT_RESOLVED', 'source': 'network', 'timestamp': 1611291689357}

Most questions relating to grabbing things from the console point me towards grabbing the logs, but nothing that seems to let me know how to grab those other variables. Any ideas?

Here's a link to a random audio page that I want to grab the file from: https://audm.herokuapp.com/player-embed?pub=newyorker&articleID=5fe0b9b09fabedf20ec1f70c

Thanks everyone!

1
  • could you accept and upvote the answer Commented Feb 2, 2021 at 14:33

2 Answers 2

1
driver.get(
    "https://audm.herokuapp.com/player-embed?pub=newyorker&articleID=5fe0b9b09fabedf20ec1f70c")

WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR,"button"))).click()
src=WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, ".react-player video"))).get_attribute("src")



print(src)

if you just want to get src you can use above code .

you need to import

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

If you want to get it through console log then use : IT SEEMS ITS WORKING ONLY FOR HEADLESS I AM INVESTIGATING:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()

options.headless = True

capabilities = webdriver.DesiredCapabilities().CHROME.copy()

capabilities['loggingPrefs'] = {'browser': 'ALL'}
driver = webdriver.Chrome(options=options,desired_capabilities=capabilities)

driver.maximize_window()


time.sleep(3)

driver.get(
    "https://audm.herokuapp.com/player-embed?pub=newyorker&articleID=5fe0b9b09fabedf20ec1f70c")



for entry in driver.get_log('browser'):
    print(entry)

Update

in headless mode w3c is false and hence it is working ,

For non headless mode you have to use:

options.add_experimental_option('w3c', False)
Sign up to request clarification or add additional context in comments.

Comments

0

This did the trick. I was looking at it the wrong way and wasn't trying to get an src. Thanks for the input!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.