code for counting word frequency in website using Python doesn't output the right frequency

Question

I'd like to count the frequency of a list of words in a specific website. The code however doesn't return the exact number of words that a manual "control F" command would. What am I doing wrong?

Here's my code:

import pandas as pd
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import re

url='https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'
fr=[] 
wanted = ['tender','2020','date']    
for word in wanted:
    a=requests.get(url).text.count(word)
    dic={'phrase':word,
          'frequency':a,              
            }          
    fr.append(dic)  
    print('Frequency of',word, 'is:',a)
data=pd.DataFrame(fr)

One thing to be aware of: requests might not give you the exact same text as you see in your browser. This can happen, for example, if the web page has JavaScript code that modifies the contents of the page. Your browser executes that code, but requests will not. On the other hand, selenium will give you exactly the same thing as you see in your browser. If you know there is JavaScript code, then you should use selenium instead of requests. — Code-Apprentice
– Code-Apprentice, Commented Apr 27, 2021 at 22:51
Please supply the expected minimal, reproducible example (MRE). We should be able to copy and paste a contiguous block of your code, execute that file, and reproduce your problem along with tracing output for the problem points. This lets us test our suggestions against your test data and desired output. — Prune
– Prune, Commented Apr 27, 2021 at 22:53
In particular, what are the specific discrepancies? Which is the correct value, and why? What do the interface documents say about their operation? Perhaps they have different definitions of counting a given word, such that a difference is actually the correct response. — Prune
– Prune, Commented Apr 27, 2021 at 22:53

Camilo Martínez M. · Accepted Answer · 2021-04-27 23:15:27Z

1

Refer to the comments in your question to see why using requests might be a bad idea to count the frequency of a word in the "visible spectrum" of a webpage (what you actually see in the browser).

If you want to go about this with selenium, you could try:

from selenium import webdriver

url = 'https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'

driver = webdriver.Chrome(chromedriver_location)
driver.get(url)
body = driver.find_element_by_tag_name('body')

fr = [] 
wanted = ['tender', '2020', 'date']    
for word in wanted:
    freq = body.text.lower().count(word) # .lower() to account for count's case sensitive behaviour
    dic = {'phrase': word, 'frequency': freq}          
    fr.append(dic)  
    print('Frequency of', word, 'is:', freq)

which gave me the same results that a CTRL + F does.

You can test BeautifulSoup too (which you're importing by the way) by modifying your code a little bit:

import requests
from bs4 import BeautifulSoup

url = 'https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'
fr = [] 
wanted = ['tender','2020','date']    
a = requests.get(url).text
soup = BeautifulSoup(a, 'html.parser')
for word in wanted:
    freq = soup.get_text().lower().count(word)
    dic = {'phrase': word, 'frequency': freq}          
    fr.append(dic)  
    print('Frequency of', word, 'is:', freq)

That gave me the same results, except for the word tender, which according to BeautifulSoup appears 12 times, and not 11. Test them out for yourself and see what suits you.

answered Apr 27, 2021 at 23:15

Camilo Martínez M.

1,6391 gold badge10 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Fatima El Mansouri Over a year ago

This is excellent! thank you so much for your insight Camilo !!

Fatima El Mansouri Over a year ago

Selenium worked perfectly for me! This is however only a snippet from a code which loops through a dataframe containing URLs and counts specific keywords for each URL. I have 20+ URLs in the DF, is there a way to not have that many windows open while looping through the URLs with Selenium? Thank you again for your great answer!

Camilo Martínez M. Over a year ago

I'm glad it helped you. Regarding the opened browser windows, I am not sure. I haven't tried this but a quick search lead me here (the second answer, not the accepted one) stackoverflow.com/questions/7593611/…. That should get you going

Fatima El Mansouri Over a year ago

Thank you again, I appreciate you taking the time to answer :) !! This worked for me as well! Have a great day !

Khoa Nguyen · Accepted Answer · 2021-04-28 09:44:35Z

1

When I tried your code on the word "Tender", a=requests.get(url).text.count(word) returned many more results than ctrl + F, which was weird because I was expecting to return less ( text.count is case-sensitive, HTML sometimes breaks elements into multiple lines and all that ). But by printing the variable "a" and going through it you'll notice there are elements that aren't displayed on the page, also that there are plenty of "Tender" between tags. I'd advise you to use BeautifulSoup or find some way to avoid going through the invisible text.

And by the way, little thing, you can put the requests.get(url).text as a variable out of the loop so you don't have to send a request at every iteration.

edited Apr 28, 2021 at 9:44

Khoa Nguyen

1,3077 silver badges21 bronze badges

answered Apr 27, 2021 at 23:09

Reda loukhnati

412 bronze badges

1 Comment

Fatima El Mansouri Over a year ago

Got it!! Thank you so much Reda for your contribution !!

Collectives™ on Stack Overflow

code for counting word frequency in website using Python doesn't output the right frequency

2 Answers 2

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related