cleaning scraped text in python

Question

I am new to python and just started learning web-scraping using beautiful soup (in Jupyter notebook). I scraped a book off Project Gutenberg, and want to do translation. However, had trouble cleaning the text, followed by doing the translation.

I want to get rid of the stuff at the beginning of the scraped text (e.g.BODY { color: Black; background: White;....) and after that translate the entire text using google API.

Would be grateful for help/advice on both. my code so far is below.The translation code did not work, and returned the following error "WriteError: [Errno 32] Broken pipe"

#Store url
url = 'https://www.gutenberg.org/files/514/514-h/514-h.htm'
html = r.text
print(html)
#Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html5lib")
type(soup)
#Scrape entire text using 'get' and print it
text = soup.get_text()
print(text)
#translate text using google API translator
init the Google API translator
translator = Translator()
translation = translator.translate(text,dest="ar")
print(translation)

DisappointedByUnaccountableMod · Accepted Answer · 2021-05-12 06:03:26Z

2

As you want to scrape the text data so you can find it out from elements that text is written in p tag with find_all method in bs4 module so you can get the text data from it

from bs4 import BeautifulSoup
import requests
url = 'https://www.gutenberg.org/files/514/514-h/514-h.htm'
response=requests.get(url)
html = response.text
# print(html)
#Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html.parser")
paragraph=soup.find_all("p")
for para in paragraph:
    print(para.text)

Output:
"Christmas won't be Christmas without any presents," grumbled Jo, lying
on the rug.
...

edited May 12, 2021 at 6:03

DisappointedByUnaccountableMod

6,8444 gold badges21 silver badges23 bronze badges

answered May 12, 2021 at 4:32

Bhavya Parikh

3,3982 gold badges11 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

cleaning scraped text in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related