How to save html to a text file with Python, Selenium and BeautifulSoup

Question

I am trying to use BeautifulSoup and Selenium to webscrape youtube playlists. I would like to be able to save the html for a webpage to a text file so that while I get BeautifulSoup working, I do not need to continually run the rest of the script to open the browser and get the html.

This is a shortened version of my code, which is giving the error: "UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 0: character maps to ".
I know I could save it to a text file as a utf-8 format, but I am not sure how I would be able to convert this back to ASCII to parse it with BeautifulSoup.

My code:

from pathlib import Path
from selenium import webdriver
from bs4 import BeautifulSoup
def test_html_save():
    playlist_url = 'https://www.youtube.com/watch?v=IdneKLhsWOQ&list=PLMEZyDHJojxNYSVgRCPt589DI5H7WT1ZK'
    browser = webdriver.Firefox()
    browser.get(playlist_url)
    html_content = browser.page_source  # Getting the html from the webpage
    browser.close()
    soup = BeautifulSoup(html_content, 'html.parser') # creates a beautiful soup object 'soup'.

    html_save_path = Path(__file__).parent / ".//html_save_test.txt"

    with open(html_save_path, 'wt') as html_file:
        for line in soup.prettify():
            html_file.write(line)

test_html_save()

My question is just how can I save the entire html of a webpage to a .txt file?

Maybe this helps? stackoverflow.com/questions/20906416/…

Michael Ma
– Michael Ma

2020-10-27 16:13:54 +00:00
Commented Oct 27, 2020 at 16:13 — Michael Ma
– Michael Ma, Commented Oct 27, 2020 at 16:13

Sushil · Accepted Answer · 2020-10-27 17:11:01Z

3

Set the encoding parameter to utf-8:

with open(html_save_path, 'wt', encoding='utf-8') as html_file:
    for line in soup.prettify():
        html_file.write(line)

Your intention is to scrape the video title and the channel name from the video. Here is the full code to do it:

from pathlib import Path
from selenium import webdriver
from bs4 import BeautifulSoup
import time

def test_html_save():
    playlist_url = 'https://www.youtube.com/watch?v=IdneKLhsWOQ&list=PLMEZyDHJojxNYSVgRCPt589DI5H7WT1ZK'
    browser = webdriver.Chrome()
    browser.get(playlist_url)
    time.sleep(4) #Waits for 4 secs until the page loads
    html_content = browser.page_source  # Getting the html from the webpage
    browser.close()
    soup = BeautifulSoup(html_content, 'html.parser') # creates a beautiful soup object 'soup'.

    html_save_path = "D:\\bs4_html.txt"

    with open(html_save_path, 'wt', encoding='utf-8') as html_file:
        for line in soup.prettify():
            html_file.write(line)

    title = soup.find('yt-formatted-string', class_ = 'style-scope ytd-video-primary-info-renderer').text
    channel_name = soup.find('a', class_ = 'yt-simple-endpoint style-scope yt-formatted-string').text
    print(f"Video Title: {title}")
    print(f"Channel Name: {channel_name}")

test_html_save()

Output:

Video Title: Taylor Swift - Wildest Dreams
Channel Name: Taylor Swift

edited Oct 27, 2020 at 17:11

answered Oct 27, 2020 at 15:57

Sushil

5,5312 gold badges12 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Tom Reid Over a year ago

How would I parse this with beautifulsoup after though? Because it turns the file into just numbers as far as I can see. Thank you for your help.

Sushil Over a year ago

What do u wanna extract from the file?

Tom Reid Over a year ago

It is a youtube playlist, so I want to extract the video title and the name of the youtube channel which posted the video

Sushil Over a year ago

Video title of all the songs in the playlist? Or only the first song?

Tom Reid Over a year ago

thank you very much for your help, your solution works!

Collectives™ on Stack Overflow

How to save html to a text file with Python, Selenium and BeautifulSoup

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related