0

I'd like to count the frequency of a list of words in a specific website. The code however doesn't return the exact number of words that a manual "control F" command would. What am I doing wrong?

Here's my code:

import pandas as pd
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import re

url='https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'
fr=[] 
wanted = ['tender','2020','date']    
for word in wanted:
    a=requests.get(url).text.count(word)
    dic={'phrase':word,
          'frequency':a,              
            }          
    fr.append(dic)  
    print('Frequency of',word, 'is:',a)
data=pd.DataFrame(fr)    
4
  • Read this article for tips about debugging your code. Commented Apr 27, 2021 at 22:49
  • 4
    One thing to be aware of: requests might not give you the exact same text as you see in your browser. This can happen, for example, if the web page has JavaScript code that modifies the contents of the page. Your browser executes that code, but requests will not. On the other hand, selenium will give you exactly the same thing as you see in your browser. If you know there is JavaScript code, then you should use selenium instead of requests. Commented Apr 27, 2021 at 22:51
  • Please supply the expected minimal, reproducible example (MRE). We should be able to copy and paste a contiguous block of your code, execute that file, and reproduce your problem along with tracing output for the problem points. This lets us test our suggestions against your test data and desired output. Commented Apr 27, 2021 at 22:53
  • In particular, what are the specific discrepancies? Which is the correct value, and why? What do the interface documents say about their operation? Perhaps they have different definitions of counting a given word, such that a difference is actually the correct response. Commented Apr 27, 2021 at 22:53

2 Answers 2

1

Refer to the comments in your question to see why using requests might be a bad idea to count the frequency of a word in the "visible spectrum" of a webpage (what you actually see in the browser).

If you want to go about this with selenium, you could try:

from selenium import webdriver

url = 'https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'

driver = webdriver.Chrome(chromedriver_location)
driver.get(url)
body = driver.find_element_by_tag_name('body')

fr = [] 
wanted = ['tender', '2020', 'date']    
for word in wanted:
    freq = body.text.lower().count(word) # .lower() to account for count's case sensitive behaviour
    dic = {'phrase': word, 'frequency': freq}          
    fr.append(dic)  
    print('Frequency of', word, 'is:', freq)

which gave me the same results that a CTRL + F does.

You can test BeautifulSoup too (which you're importing by the way) by modifying your code a little bit:

import requests
from bs4 import BeautifulSoup

url = 'https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'
fr = [] 
wanted = ['tender','2020','date']    
a = requests.get(url).text
soup = BeautifulSoup(a, 'html.parser')
for word in wanted:
    freq = soup.get_text().lower().count(word)
    dic = {'phrase': word, 'frequency': freq}          
    fr.append(dic)  
    print('Frequency of', word, 'is:', freq)

That gave me the same results, except for the word tender, which according to BeautifulSoup appears 12 times, and not 11. Test them out for yourself and see what suits you.

Sign up to request clarification or add additional context in comments.

4 Comments

This is excellent! thank you so much for your insight Camilo !!
Selenium worked perfectly for me! This is however only a snippet from a code which loops through a dataframe containing URLs and counts specific keywords for each URL. I have 20+ URLs in the DF, is there a way to not have that many windows open while looping through the URLs with Selenium? Thank you again for your great answer!
I'm glad it helped you. Regarding the opened browser windows, I am not sure. I haven't tried this but a quick search lead me here (the second answer, not the accepted one) stackoverflow.com/questions/7593611/…. That should get you going
Thank you again, I appreciate you taking the time to answer :) !! This worked for me as well! Have a great day !
1

When I tried your code on the word "Tender", a=requests.get(url).text.count(word) returned many more results than ctrl + F, which was weird because I was expecting to return less ( text.count is case-sensitive, HTML sometimes breaks elements into multiple lines and all that ). But by printing the variable "a" and going through it you'll notice there are elements that aren't displayed on the page, also that there are plenty of "Tender" between tags. I'd advise you to use BeautifulSoup or find some way to avoid going through the invisible text.

And by the way, little thing, you can put the requests.get(url).text as a variable out of the loop so you don't have to send a request at every iteration.

1 Comment

Got it!! Thank you so much Reda for your contribution !!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.