1

I am trying to use BeautifulSoup and Selenium to webscrape youtube playlists. I would like to be able to save the html for a webpage to a text file so that while I get BeautifulSoup working, I do not need to continually run the rest of the script to open the browser and get the html.

This is a shortened version of my code, which is giving the error: "UnicodeEncodeError: 'charmap' codec can't encode character '\u200b' in position 0: character maps to ".
I know I could save it to a text file as a utf-8 format, but I am not sure how I would be able to convert this back to ASCII to parse it with BeautifulSoup.

My code:

from pathlib import Path
from selenium import webdriver
from bs4 import BeautifulSoup
def test_html_save():
    playlist_url = 'https://www.youtube.com/watch?v=IdneKLhsWOQ&list=PLMEZyDHJojxNYSVgRCPt589DI5H7WT1ZK'
    browser = webdriver.Firefox()
    browser.get(playlist_url)
    html_content = browser.page_source  # Getting the html from the webpage
    browser.close()
    soup = BeautifulSoup(html_content, 'html.parser') # creates a beautiful soup object 'soup'.

    html_save_path = Path(__file__).parent / ".//html_save_test.txt"

    with open(html_save_path, 'wt') as html_file:
        for line in soup.prettify():
            html_file.write(line)

test_html_save()

My question is just how can I save the entire html of a webpage to a .txt file?

1

1 Answer 1

3

Set the encoding parameter to utf-8:

with open(html_save_path, 'wt', encoding='utf-8') as html_file:
    for line in soup.prettify():
        html_file.write(line)

Your intention is to scrape the video title and the channel name from the video. Here is the full code to do it:

from pathlib import Path
from selenium import webdriver
from bs4 import BeautifulSoup
import time

def test_html_save():
    playlist_url = 'https://www.youtube.com/watch?v=IdneKLhsWOQ&list=PLMEZyDHJojxNYSVgRCPt589DI5H7WT1ZK'
    browser = webdriver.Chrome()
    browser.get(playlist_url)
    time.sleep(4) #Waits for 4 secs until the page loads
    html_content = browser.page_source  # Getting the html from the webpage
    browser.close()
    soup = BeautifulSoup(html_content, 'html.parser') # creates a beautiful soup object 'soup'.

    html_save_path = "D:\\bs4_html.txt"

    with open(html_save_path, 'wt', encoding='utf-8') as html_file:
        for line in soup.prettify():
            html_file.write(line)

    title = soup.find('yt-formatted-string', class_ = 'style-scope ytd-video-primary-info-renderer').text
    channel_name = soup.find('a', class_ = 'yt-simple-endpoint style-scope yt-formatted-string').text
    print(f"Video Title: {title}")
    print(f"Channel Name: {channel_name}")

test_html_save()

Output:

Video Title: Taylor Swift - Wildest Dreams
Channel Name: Taylor Swift
Sign up to request clarification or add additional context in comments.

5 Comments

How would I parse this with beautifulsoup after though? Because it turns the file into just numbers as far as I can see. Thank you for your help.
What do u wanna extract from the file?
It is a youtube playlist, so I want to extract the video title and the name of the youtube channel which posted the video
Video title of all the songs in the playlist? Or only the first song?
thank you very much for your help, your solution works!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.