2

I'm trying to scrape synonyms from thesaurus.com using Python and they list the synonyms using an unordered list.

from lxml import html
import requests
term = (input("Enter in a term to find the synonyms of: "))
page = requests.get('http://www.thesaurus.com/browse/' + term.lower(),allow_redirects=True)
if page.status_code == 200:
    tree = html.fromstring(page.content)
    synonyms = tree.xpath('//div[@class="relevancy-list"]/text()')
    print(synonyms)
else:
    print("No synonyms found!")

My code outputs just blank spaces instead of the synonyms. How do I scrape the actual synonyms instead of the spaces.

2 Answers 2

1

The /text() only prints the text immediately under the current tag. So your current code will not print the synonyms since it's under another tag inside the div tag.

You should use //text() to print all texts under the current tag. But this will print ALL texts, including the unnecessary ones.

For your use case, since the synonyms are inside a <span class="text"> tag, you can use this XPath:

//div[@class="relevancy-list"]//span[@class="text"]/text()

which selects all texts found inside a span with class "text" found inside a div with class "relevancy-list".

For input term set, the output using that XPath is:

['firm', 'bent', 'stated', 'specified', 'rooted', 'established', 'confirmed', 'pat', 'immovable', 'obstinate', 'ironclad', 'predetermined', 'intent', 'entrenched', 'appointed', 'regular', 'prescribed', 'determined', 'scheduled', 'fixed', 'settled', 'certain', 'customary', 'decisive', 'definite', 'inveterate', 'pigheaded', 'resolute', 'rigid', 'steadfast', 'stubborn', 'unflappable', 'usual', 'concluded', 'agreed', 'resolved', 'stipulated', 'arranged', 'prearranged', 'dead set on', 'hanging tough', 'locked in', 'set in stone', 'solid as a rock', 'stiff-necked', 'well-set', 'immovable', 'entrenched', 'located', 'solid', 'situate', 'stiff', 'placed', 'stable', 'fixed', 'settled', 'situated', 'rigid', 'strict', 'stubborn', 'unyielding', 'hidebound', 'positioned', 'sited', 'jelled', 'hard and fast', 'deportment', 'comportment', 'fit', 'presence', 'mien', 'hang', 'carriage', 'air', 'turn', 'attitude', 'address', 'demeanor', 'position', 'inclination', 'port', 'posture', 'setting', 'scene', 'scenery', 'flats', 'stage set', u'mise en sc\xe8ne', 'series', 'array', 'lot', 'collection', 'batch', 'crowd', 'cluster', 'gang', 'bunch', 'crew', 'circle', 'body', 'coterie', 'faction', 'company', 'bundle', 'outfit', 'band', 'clique', 'mob', 'kit', 'class', 'clan', 'compendium', 'clutch', 'camp', 'sect', 'push', 'organization', 'clump', 'assemblage', 'pack', 'gaggle', 'rat pack', 'locate', 'head', 'prepare', 'fix', 'introduce', 'turn', 'settle', 'lay', 'install', 'put', 'apply', 'post', 'establish', 'wedge', 'point', 'lock', 'affix', 'direct', 'rest', 'seat', 'station', 'plop', 'spread', 'lodge', 'situate', 'plant', 'park', 'bestow', 'train', 'stick', 'plank', 'arrange', 'insert', 'level', 'plunk', 'mount', 'aim', 'cast', 'deposit', 'ensconce', 'fasten', 'embed', 'anchor', 'make fast', 'make ready', 'zero in', 'appoint', 'name', 'schedule', 'make', 'impose', 'stipulate', 'settle', 'determine', 'establish', 'fix', 'specify', 'designate', 'decree', 'resolve', 'rate', 'conclude', 'price', 'prescribe', 'direct', 'value', 'ordain', 'allocate', 'instruct', 'allot', 'dictate', 'estimate', 'regulate', 'assign', 'arrange', 'lay down', 'agree upon', 'fix price', 'fix', 'stiffen', 'thicken', 'condense', 'jelly', 'clot', 'congeal', 'solidify', 'cake', 'coagulate', 'jell', 'gelatinize', 'crystallize', 'jellify', 'gel', 'become firm', 'gelate', 'drop', 'subside', 'sink', 'vanish', 'dip', 'disappear', 'descend', 'go down', 'initiate', 'begin', 'raise', 'abet', 'provoke', 'instigate', 'commence', 'foment', 'whip up', 'put in motion', 'set on', 'stir up']

Note that you will get the synonyms for all senses of the word.

You might want to loop over the result of //div[@class="relevancy-list"] manually, and extract the //span[@class="text"]/text() for each div found to get the synonyms per sense.

Sign up to request clarification or add additional context in comments.

Comments

0
import requests
from bs4 import BeautifulSoup

term = input("Enter in a term to find the synonyms of: ")
page = requests.get('http://www.thesaurus.com/browse/' + term.lower(), allow_redirects=True)

if page.status_code == 200:
    soup = BeautifulSoup(page.content, 'html.parser')
    get_syn_tag = soup.find('div', {'class': 'relevancy-list'})
    list_items = get_syn_tag.findAll('li')
    synonyms = []  # to fetch synonym anytime used list to append all synonyms
    for i in list_items:
        synonym = i.find('span', {'class':'text'}).text
        print(synonym) # prints single synonym on each iteration
        synonyms.append(synonym) # appends synonym to list
else:
    print("No synonyms found!")

finding all li tag is to be more precise, however in this case below line will also work :

synonym_list = [i.text for i in get_syn_tag.findAll('span', {'class':'text'})] # this will create a list of all available synonyms if there is no other `span` tag with same class `text` in the specified `div`

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.