0

So I tried getting all the headlines of the New York Times homepage and wanted to see how many times a certain word has been mentioned. In this particular case, I wanted to see how many headlines mentioned either the Coronavirus or Trump. This is my code but it won't work as 'number' remains the integer I give it before the while loop.

import requests
from bs4 import BeautifulSoup

url = 'https://www.nytimes.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
a = soup.findAll("h2", class_="esl82me0")

for story_heading in a:
    print(story_heading.contents[0])

lijst = ["trump", "Trump", "Corona", "COVID", "virus", "Virus", "Coronavirus", "COVID-19"]
number = 0
run = 0

while run < len(a)+1:
    run += 1
     if any(lijst in s for s in a)
        number += 1

print("\nTrump or the Corona virus have been mentioned", number, "times.")

So I basically want the variable 'number' to increase by 1 if a headline (which is an entry in the list a) has the word Trump or Coronavirus or both in them.

Does anyone know how to do this?

1
  • This doesn't count as an answer, given that I'm not giving you a complete solution, but typically you would do the following: 1. Fetch the contents. 2. Cast all text to lowercase so that the matching can be efficient. 3. Tokenize the text into individual entities. Good options are SpaCy and NLTK. 4. A question of counting and sorting. collections.Counter would do the trick for you. Commented Apr 12, 2020 at 18:36

1 Answer 1

1

In general, I recommend putting more thought into naming variables. I like how you tried to print the story headings. The line if any(lijst in s for s in a) does not do what you think it should: you need to instead iterate over each word in a single h2. The any function is just a short hand for the following:

def any(iterable):
    for element in iterable:
        if element:
            return True
    return False

In other words, you're trying to see if an entire list is in an h2 element, which will never be true. Here is an example fix.

import requests
from bs4 import BeautifulSoup

url = 'https://www.nytimes.com'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
h2s = soup.findAll("h2", class_="esl82me0")

for story_heading in h2s:
    print(story_heading.contents[0])

keywords = ["trump", "Trump", "Corona", "COVID", "virus", "Virus", "Coronavirus", "COVID-19"]
number = 0
run = 0

for h2 in h2s:
    headline = h2.text
    words_in_headline = headline.split(" ")
    for word in words_in_headline:
        if word in keywords:
            number += 1
print("\nTrump or the Corona virus have been mentioned", number, "times.")

Output

Trump or the Corona virus have been mentioned 7 times.
Sign up to request clarification or add additional context in comments.

1 Comment

It's so satisfying to see the logic behind code and now I finally understand what I did wrong. Thank you so much! :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.