How to read the source HTLM code from a locally saved HTML file using Python?

Question

I'm new to HTML and beautiful soup. I am trying to read a locally saved HTML file in Python and I tested the following code:

with open(file_path) as fp:
    soup = BeautifulSoup(fp)

print(soup)

The output looks weird and here is a part of it:

<html><body><p>ÿþh t m l &gt; 
 
 
 
 h e a d &gt; 
 
 m e t a   h t t p - e q u i v = C o n t e n t - T y p e   c o n t e n t = " t e x t / h t m l ;   c h a r s e t = u n i c o d e " &gt; 
 
 m e t a   n a m e = G e n e r a t o r   c o n t e n t = " M i c r o s o f t   W o r d   1 5   ( f i l t e r e d ) " &gt; 
 
 s t y l e &gt; 
 
 ! - - 
 
   / *   F o n t   D e f i n i t i o n s   * /

The original HTML code is something like

<html>

<head>
<meta http-equiv=Content-Type content="text/html; charset=unicode">
<meta name=Generator content="Microsoft Word 15 (filtered)">
<style>
<!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;

Can anyone help me or share some thoughts?

Thank you!

Jay Patel · Accepted Answer · 2021-04-30 14:30:53Z

First of all, let's discuss why you are not able to fetch desired Output. It is because when you parsing data in BeautifulSoup. There might be some Spaces, Symbols, etc. presented in your Code. So, the appropriate Solution for this scenario was stated below:-

Needed Solution:- Use soup.prettify()
Appropriate Solution:- Use HTML Parser and soup.prettify() together

To Learn more about HTML Parser and soup.prettify:- Click Here

Approach 1 (By using `soup.prettify()` in your Current `Code`):-

# File Path of 'HTML' File
file_path = 'demo.html'

# Fetch 'HTML' Code Using 'BeautifulSoup'
with open(file_path) as fp:
    soup = BeautifulSoup(fp)

# Print 'HTML' Code using 'prettify' Format
print(soup.prettify())

# Output of above cell:-
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Microsoft Word 15 (filtered)" name="Generator"/>
  <style>
   <!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
  </style>
 </head>
</html>

Approach 2 (By using `HTML Parser` and `soup.prettify()`):-

# Import all-important Libraries
from bs4 import BeautifulSoup
import html5lib

# Open Our 'HTML' File
html_page = open('demo.html', 'r')

# Parse it to 'HTML' Format
soup = BeautifulSoup(html_page, "html5lib")

# Print Scraped 'HTML' Code
print(soup.prettify())

# Output of above cell:-
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Microsoft Word 15 (filtered)" name="Generator"/>
  <style>
   <!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
  </style>
 </head>
</html>

Hope this Solution helps you.

Thank you for providing two solutions! They're super helpful.

Victor Moraes · Accepted Answer · 2021-04-30 14:05:55Z

1

Try print(soup.prettify()). The prettify method is helpful and displays the formatted HTML content.

According to the documentation:

The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string:

Source: Beautiful Soup Documentation

answered Apr 30, 2021 at 14:05

Victor Moraes

414 bronze badges

Collectives™ on Stack Overflow

How to read the source HTLM code from a locally saved HTML file using Python?

2 Answers 2

Approach 1 (By using `soup.prettify()` in your Current `Code`):-

Approach 2 (By using `HTML Parser` and `soup.prettify()`):-

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Approach 1 (By using soup.prettify() in your Current Code):-

Approach 2 (By using HTML Parser and soup.prettify()):-

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related

Approach 1 (By using `soup.prettify()` in your Current `Code`):-

Approach 2 (By using `HTML Parser` and `soup.prettify()`):-