1

I am new to python and just started learning web-scraping using beautiful soup (in Jupyter notebook). I scraped a book off Project Gutenberg, and want to do translation. However, had trouble cleaning the text, followed by doing the translation.

I want to get rid of the stuff at the beginning of the scraped text (e.g.BODY { color: Black; background: White;....) and after that translate the entire text using google API.

Would be grateful for help/advice on both. my code so far is below.The translation code did not work, and returned the following error "WriteError: [Errno 32] Broken pipe"

#Store url
url = 'https://www.gutenberg.org/files/514/514-h/514-h.htm'
html = r.text
print(html)
#Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html5lib")
type(soup)
#Scrape entire text using 'get' and print it
text = soup.get_text()
print(text)
#translate text using google API translator
init the Google API translator
translator = Translator()
translation = translator.translate(text,dest="ar")
print(translation)

1 Answer 1

2

As you want to scrape the text data so you can find it out from elements that text is written in p tag with find_all method in bs4 module so you can get the text data from it

from bs4 import BeautifulSoup
import requests
url = 'https://www.gutenberg.org/files/514/514-h/514-h.htm'
response=requests.get(url)
html = response.text
# print(html)
#Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html.parser")
paragraph=soup.find_all("p")
for para in paragraph:
    print(para.text)

Output:
"Christmas won't be Christmas without any presents," grumbled Jo, lying
on the rug.
...
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.