1

im trying to scrape a quizlet match set with Python. I want to scrape all the <span> tags with class: TermText

Here's the URL: 'https://quizlet.com/291523268'

import requests
raw = requests.get(URL).text

raw ends up returning things that do not contain any tags or cards at all. When I check the source of the website it shows all the TermText spans that I need meaning it's not JS loaded. Thus, I don't understand why my HTML is coming out wrong since it doesn't contain any of the html I need.

1 Answer 1

2

To get correct response from server, set correct User-Agent HTTP header:

import requests
from bs4 import BeautifulSoup


url = 'https://quizlet.com/291523268/python-flash-cards/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

for span in soup.select('span.TermText'):
    print(span.get_text(strip=True))

Prints:

algorithm
A set of specific steps for solving a category of problems
token
basic elements of a language(letters, numbers, symbols)
high-level language
A programming language like Python that is designed to be easy for humans to read and write.
low-level langauge

...and so on.
Sign up to request clarification or add additional context in comments.

2 Comments

Why was it that you needed to send the User-Agent @Andrej Kesely
@AaravM4 Without User-Agent you get Clouflare captcha page. I set User-Agent as first thing when I get these types of pages from server.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.