I am trying to use BeautifulSoup and Selenium to webscrape youtube playlists. I would like to be able to save the html for a webpage to a text file so that while I get BeautifulSoup working, I do not need to continually run the rest of the script to open the browser and get the html.
This is a shortened version of my code, which is giving the error: "UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 0: character maps to ".
I know I could save it to a text file as a utf-8 format, but I am not sure how I would be able to convert this back to ASCII to parse it with BeautifulSoup.
My code:
from pathlib import Path
from selenium import webdriver
from bs4 import BeautifulSoup
def test_html_save():
playlist_url = 'https://www.youtube.com/watch?v=IdneKLhsWOQ&list=PLMEZyDHJojxNYSVgRCPt589DI5H7WT1ZK'
browser = webdriver.Firefox()
browser.get(playlist_url)
html_content = browser.page_source # Getting the html from the webpage
browser.close()
soup = BeautifulSoup(html_content, 'html.parser') # creates a beautiful soup object 'soup'.
html_save_path = Path(__file__).parent / ".//html_save_test.txt"
with open(html_save_path, 'wt') as html_file:
for line in soup.prettify():
html_file.write(line)
test_html_save()
My question is just how can I save the entire html of a webpage to a .txt file?