1

I'm new to HTML and beautiful soup. I am trying to read a locally saved HTML file in Python and I tested the following code:

with open(file_path) as fp:
    soup = BeautifulSoup(fp)

print(soup)

The output looks weird and here is a part of it:

<html><body><p>ÿþh t m l &gt; 
 
 
 
 h e a d &gt; 
 
 m e t a   h t t p - e q u i v = C o n t e n t - T y p e   c o n t e n t = " t e x t / h t m l ;   c h a r s e t = u n i c o d e " &gt; 
 
 m e t a   n a m e = G e n e r a t o r   c o n t e n t = " M i c r o s o f t   W o r d   1 5   ( f i l t e r e d ) " &gt; 
 
 s t y l e &gt; 
 
 ! - - 
 
   / *   F o n t   D e f i n i t i o n s   * /

The original HTML code is something like

<html>

<head>
<meta http-equiv=Content-Type content="text/html; charset=unicode">
<meta name=Generator content="Microsoft Word 15 (filtered)">
<style>
<!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;

Can anyone help me or share some thoughts?

Thank you!

2 Answers 2

1

First of all, let's discuss why you are not able to fetch desired Output. It is because when you parsing data in BeautifulSoup. There might be some Spaces, Symbols, etc. presented in your Code. So, the appropriate Solution for this scenario was stated below:-

  • Needed Solution:- Use soup.prettify()
  • Appropriate Solution:- Use HTML Parser and soup.prettify() together

To Learn more about HTML Parser and soup.prettify:- Click Here


Approach 1 (By using soup.prettify() in your Current Code):-

# File Path of 'HTML' File
file_path = 'demo.html'

# Fetch 'HTML' Code Using 'BeautifulSoup'
with open(file_path) as fp:
    soup = BeautifulSoup(fp)

# Print 'HTML' Code using 'prettify' Format
print(soup.prettify())
# Output of above cell:-
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Microsoft Word 15 (filtered)" name="Generator"/>
  <style>
   <!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
  </style>
 </head>
</html>

Approach 2 (By using HTML Parser and soup.prettify()):-

# Import all-important Libraries
from bs4 import BeautifulSoup
import html5lib

# Open Our 'HTML' File
html_page = open('demo.html', 'r')

# Parse it to 'HTML' Format
soup = BeautifulSoup(html_page, "html5lib")

# Print Scraped 'HTML' Code
print(soup.prettify())
# Output of above cell:-
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Microsoft Word 15 (filtered)" name="Generator"/>
  <style>
   <!--
 /* Font Definitions */
 @font-face
    {font-family:"Cambria Math";
    panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
    {font-family:Calibri;
  </style>
 </head>
</html>

Hope this Solution helps you.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for providing two solutions! They're super helpful.
1

Try print(soup.prettify()). The prettify method is helpful and displays the formatted HTML content.

According to the documentation:

The prettify() method will turn a Beautiful Soup parse tree into a nicely formatted Unicode string, with a separate line for each tag and each string:

Source: Beautiful Soup Documentation

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.