1

I would like to extract information about website similarity from this link:

https://www.alexa.com/siteinfo/amazon.com

I am looking at class='site', trying to extract information from

<a href="/siteinfo/ebay.com" class="truncation">ebay.com</a>

but I can see only one value. Could it be possible to extract all the 4 values and related overlap score?

What I am trying to achieve is a table which includes this information

W                      amazon.com              
eBay.com                   70.1
pinterest.com              54.7
wikipedia.org              51.3
facebook.com               50.4

I have tried

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, "html.parser")
print([item.get_text(strip=True) for item in soup.select("span.site")]) 

but this seems to be enough for getting information because of some wrong parameters in the code.

5
  • Appears you want span.truncation , a.trunctation, or div.site Commented Jan 21, 2021 at 2:13
  • thank you for your comment, OneCricketeer. I can see from inspect tool on Google Chrome only the span for overlap score and site. I cannot see the tags you mentioned Commented Jan 21, 2021 at 2:20
  • this page uses JavaScript to add elements - but BeautifulSoup and requests can't run JavaScript - you may need Selenium to control real web browser which can run JavaScript Commented Jan 21, 2021 at 2:31
  • That's not true @furas. While it does use JS for some features, the table the OP refers too is loaded normally and can be detected without the need for a headless browser Commented Jan 21, 2021 at 2:33
  • a.truncation is the element that you've shown in the question. And the scores look like <span class="truncation">38.0</span>, so span.truncation. For site classes, those are only on div elements Commented Jan 21, 2021 at 4:17

1 Answer 1

4

Your CSS selectors were a good start but were too narrow. The CSS Selectors that you should have used were:

  • Websites: #card_mini_audience .site>a
  • Scores: #card_mini_audience .overlap>.truncation

These selectors narrow the focus to the div where the table is stored and then makes use of the class labels to extract your desired information.

I have attached some example code below that solves your issue. I just printed the results to the screen but it can easily be changed to do whatever you want with the values.

from bs4 import BeautifulSoup
import requests

#Getting the website and processing it
url = "https://www.alexa.com/siteinfo/amazon.com"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

#Using CSS Selectors to grab content
websites = soup.select("#card_mini_audience .site>a")   #Selects the websites in the table
scores = soup.select("#card_mini_audience .overlap>.truncation")    #Selects the corresponding scores

#Goes through the list and extracts just the text
websites = [website.text.strip() for website in websites]
scores = [float(score.text.strip()) for score in scores]    #Converts the scores to floats

#Ordinary print to screen. You can change this to add to a dataframe or whatever else you want for your project
for pair in zip(websites, scores):
    print(pair)

The output looks like this:

('ebay.com', 70.1)
('pinterest.com', 54.7)
('wikipedia.org', 51.3)
('facebook.com', 50.4)
('reddit.com', 49.6)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.