Extracting list of urls from url using BeautifulSoup

Question

I would like to extract information about website similarity from this link:

https://www.alexa.com/siteinfo/amazon.com

I am looking at class='site', trying to extract information from

<a href="/siteinfo/ebay.com" class="truncation">ebay.com</a>

but I can see only one value. Could it be possible to extract all the 4 values and related overlap score?

What I am trying to achieve is a table which includes this information

W                      amazon.com              
eBay.com                   70.1
pinterest.com              54.7
wikipedia.org              51.3
facebook.com               50.4

I have tried

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, "html.parser")
print([item.get_text(strip=True) for item in soup.select("span.site")])

but this seems to be enough for getting information because of some wrong parameters in the code.

Appears you want span.truncation , a.trunctation, or div.site — OneCricketeer
– OneCricketeer, Commented Jan 21, 2021 at 2:13
thank you for your comment, OneCricketeer. I can see from inspect tool on Google Chrome only the span for overlap score and site. I cannot see the tags you mentioned — Math
– Math, Commented Jan 21, 2021 at 2:20
this page uses JavaScript to add elements - but BeautifulSoup and requests can't run JavaScript - you may need Selenium to control real web browser which can run JavaScript — furas
– furas, Commented Jan 21, 2021 at 2:31
That's not true @furas. While it does use JS for some features, the table the OP refers too is loaded normally and can be detected without the need for a headless browser — Akilan Manivannan
– Akilan Manivannan, Commented Jan 21, 2021 at 2:33
a.truncation is the element that you've shown in the question. And the scores look like <span class="truncation">38.0</span>, so span.truncation. For site classes, those are only on div elements — OneCricketeer
– OneCricketeer, Commented Jan 21, 2021 at 4:17

Akilan Manivannan · Accepted Answer · 2021-01-21 02:32:51Z

Your CSS selectors were a good start but were too narrow. The CSS Selectors that you should have used were:

Websites: #card_mini_audience .site>a
Scores: #card_mini_audience .overlap>.truncation

These selectors narrow the focus to the div where the table is stored and then makes use of the class labels to extract your desired information.

I have attached some example code below that solves your issue. I just printed the results to the screen but it can easily be changed to do whatever you want with the values.

from bs4 import BeautifulSoup
import requests

#Getting the website and processing it
url = "https://www.alexa.com/siteinfo/amazon.com"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

#Using CSS Selectors to grab content
websites = soup.select("#card_mini_audience .site>a")   #Selects the websites in the table
scores = soup.select("#card_mini_audience .overlap>.truncation")    #Selects the corresponding scores

#Goes through the list and extracts just the text
websites = [website.text.strip() for website in websites]
scores = [float(score.text.strip()) for score in scores]    #Converts the scores to floats

#Ordinary print to screen. You can change this to add to a dataframe or whatever else you want for your project
for pair in zip(websites, scores):
    print(pair)

The output looks like this:

('ebay.com', 70.1)
('pinterest.com', 54.7)
('wikipedia.org', 51.3)
('facebook.com', 50.4)
('reddit.com', 49.6)

Collectives™ on Stack Overflow

Extracting list of urls from url using BeautifulSoup

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related