Web scraping an xml page in python?

Question

I am confused as to how I would scrape all the links (that only contain the string "mp3") off a given xml page. The following code only returns empty brackets:

# Import required modules 
from lxml import html 
import requests 
  
# Request the page 
page = requests.get('https://feeds.megaphone.fm/darknetdiaries') 
  
# Parsing the page 
# (We need to use page.content rather than  
# page.text because html.fromstring implicitly 
# expects bytes as input.) 
tree = html.fromstring(page.content)   
  
# Get element using XPath 
buyers = tree.xpath('//enclosure[@url="mp3"]/text()') 
print(buyers)

Am I using @url wrong?

The links I am looking for:

Any help would be greatly appreciated!

HedgeHog · Accepted Answer · 2021-01-20 19:49:05Z

2

What happens?

The following xpath wont work, as you mentioned it is the use of @url and also text()

//enclosure[@url="mp3"]/text()

Solution

The attribute url in any //enclosure should contain mp3 and then returned /@url

Change this line:

buyers = tree.xpath('//enclosure[@url="mp3"]/text()')

to

buyers = tree.xpath('//enclosure[contains(@url,"mp3")]/@url')

Output

['https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9231072845.mp3?updated=1610644901',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2643452814.mp3?updated=1609788944',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV5381316822.mp3?updated=1607279433',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9145504181.mp3?updated=1607280708',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV4345070838.mp3?updated=1606110384',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV8112097820.mp3?updated=1604866665',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2164178070.mp3?updated=1603781321',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV1107638673.mp3?updated=1610220449',
...]

edited Jan 20, 2021 at 19:49

answered Jan 20, 2021 at 19:16

HedgeHog

25.4k5 gold badges18 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

love2phish Over a year ago

Sorry about that! Was doing some testing with another page and forgot to update the link in the original code again. Your code works great, Thanks!

carrvo · Accepted Answer · 2021-01-20 23:37:56Z

1

It does not directly answer your question, but you could check out BeautifulSoup as an alternative (and it has an option to use lxml under the hoop too).

import lxml # early failure if not installed
from bs4 import BeautifulSoup
import requests 
  
# Request the page 
page = requests.get('https://feeds.megaphone.fm/darknetdiaries') 

# Parse
soup = BeautifulSoup(page.text, 'lxml')

# Find
#mp3 = [link['href'] for link in soup.find_all('a') if 'mp3' in link['href']]
# UPDATE - correct tag and attribute
mp3 = [link['url'] for link in soup.find_all('enclosure') if 'mp3' in link['url']]

edited Jan 20, 2021 at 23:37

answered Jan 20, 2021 at 19:27

carrvo

6777 silver badges15 bronze badges

4 Comments

carrvo Over a year ago

Note that my answer has a bug in it because I do not know the structure of the xml. I simply assume that all the links you are intending are enclosed in a <a> and that when you say mp3 you are talking about inside the href attribute.

love2phish Over a year ago

Fixed the question to include what I was looking for in more detail.

love2phish Over a year ago

Also, I had tried to get BeautifulSoup to work but I couldn't. I felt like I was doing everything right but I would never get a table of information. Might be that I'm using VSCode? In the answer above I was able to get the code to work, but I am also very interested to learn BeautifulSoup as it looks very powerful! Thank you for the time you put into my question!

carrvo Over a year ago

Updated it with the correct tag and attribute as per your update. Likely you just need to read through the docs and play around with it interactively.

Collectives™ on Stack Overflow

Web scraping an xml page in python?

2 Answers 2

What happens?

Solution

1 Comment

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

What happens?

Solution

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related