0

I am confused as to how I would scrape all the links (that only contain the string "mp3") off a given xml page. The following code only returns empty brackets:

# Import required modules 
from lxml import html 
import requests 
  
# Request the page 
page = requests.get('https://feeds.megaphone.fm/darknetdiaries') 
  
# Parsing the page 
# (We need to use page.content rather than  
# page.text because html.fromstring implicitly 
# expects bytes as input.) 
tree = html.fromstring(page.content)   
  
# Get element using XPath 
buyers = tree.xpath('//enclosure[@url="mp3"]/text()') 
print(buyers)

Am I using @url wrong?

The links I am looking for:

enter image description here

Any help would be greatly appreciated!

2 Answers 2

2

What happens?

The following xpath wont work, as you mentioned it is the use of @url and also text()

//enclosure[@url="mp3"]/text()

Solution

The attribute url in any //enclosure should contain mp3 and then returned /@url

Change this line:

buyers = tree.xpath('//enclosure[@url="mp3"]/text()') 

to

buyers = tree.xpath('//enclosure[contains(@url,"mp3")]/@url') 

Output

['https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9231072845.mp3?updated=1610644901',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2643452814.mp3?updated=1609788944',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV5381316822.mp3?updated=1607279433',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9145504181.mp3?updated=1607280708',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV4345070838.mp3?updated=1606110384',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV8112097820.mp3?updated=1604866665',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2164178070.mp3?updated=1603781321',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV1107638673.mp3?updated=1610220449',
...]
Sign up to request clarification or add additional context in comments.

1 Comment

Sorry about that! Was doing some testing with another page and forgot to update the link in the original code again. Your code works great, Thanks!
1

It does not directly answer your question, but you could check out BeautifulSoup as an alternative (and it has an option to use lxml under the hoop too).

import lxml # early failure if not installed
from bs4 import BeautifulSoup
import requests 
  
# Request the page 
page = requests.get('https://feeds.megaphone.fm/darknetdiaries') 

# Parse
soup = BeautifulSoup(page.text, 'lxml')

# Find
#mp3 = [link['href'] for link in soup.find_all('a') if 'mp3' in link['href']]
# UPDATE - correct tag and attribute
mp3 = [link['url'] for link in soup.find_all('enclosure') if 'mp3' in link['url']]

4 Comments

Note that my answer has a bug in it because I do not know the structure of the xml. I simply assume that all the links you are intending are enclosed in a <a> and that when you say mp3 you are talking about inside the href attribute.
Fixed the question to include what I was looking for in more detail.
Also, I had tried to get BeautifulSoup to work but I couldn't. I felt like I was doing everything right but I would never get a table of information. Might be that I'm using VSCode? In the answer above I was able to get the code to work, but I am also very interested to learn BeautifulSoup as it looks very powerful! Thank you for the time you put into my question!
Updated it with the correct tag and attribute as per your update. Likely you just need to read through the docs and play around with it interactively.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.