Is there a way to parse HTML in Python but preserve the source formatting?

Question

I want to iterate through an input file and produce an output file that is a byte-for-byte match of the input file, except for some new elements inserted in certain places.

I looked at the HTMLParser docs but I don't see any options to preserve formatting.

You want the resulting string to match the source string except for your insertion of new elements. I don't think that is possible, we use lxml a lot and use the tostring method a lot but if there is any bad html in the source then there is an attempt to fix it in tostring. Have you tried using tostring? Maybe your html is clean enough? — PyNEwbie
– PyNEwbie, Commented Dec 30, 2015 at 20:13
I want this to make it easy for my team to see diffs between input and output code. This is definitely possible, I just don't know which library (if any) can do this for HTML code. srcML does this for C code, and it's just a matter for any parser to keep the raw input content for each node in the parse tree. — Jason S
– Jason S, Commented Dec 30, 2015 at 21:05

Jason S · Accepted Answer · 2016-01-04 18:00:49Z

Looks like I can use HTMLParser except for a few fringe issues (bogus comments and nonstandard end tags) by subclassing the following class's onStartTag() and onEndTag() methods.

from HTMLParser import HTMLParser

class VerbatimParser(HTMLParser):
    def __init__(self, out):
        HTMLParser.__init__(self)
        # @#%#@% HTMLParser uses old-style classes, can't use super()
        self.out = out
        self.tagstack = []
    def emit(self, text):
        self.out.write(text)
    def handle_starttag(self, tag, attrs):
        self.tagstack.append(tag)
        self.emit(self.get_starttag_text())
        self.onStartTag(tag, attrs)
    def onStartTag(self, tag, attrs):
        pass
    def onEndTag(self, tag):
        pass
    def handle_endtag(self, tag):
        self.onEndTag(tag)
        # pop last occurrence of tag, along with any more recent tags
        try:
            k = self.tagstack[::-1].index(tag)
            del self.tagstack[-k-1:]
        except ValueError:
            pass        
        self.emit('</')
        self.emit(tag)
        self.emit('>')
    def handle_startendtag(self, tag, attrs):
        self.emit(self.get_starttag_text())
    def handle_data(self, data):
        self.emit(data)
    def handle_entityref(self, name):
        self.emit('&')
        self.emit(name)
        self.emit(';')
    def handle_charref(self, name):
        self.emit('&#')
        self.emit(name)
        self.emit(';')
    def handle_comment(self, data):
        self.emit('<!--')
        self.emit(data)
        self.emit('-->')
    def handle_decl(self, decl):
        self.emit('<!')
        self.emit(decl)
        self.emit('>')
    def handle_pi(self, data):
        self.emit('<?')
        self.emit(data)
        self.emit('>')
    def unknown_decl(self, data):
        self.emit('<![')
        self.emit(data)
        self.emit(']>')

def doit(infile, outfile):
    with open(outfile,'w') as fout:
        parser = VerbatimParser(fout)
        with open(infile) as f:
            parser.feed(f.read())
            parser.close()

alecxe · Accepted Answer · 2015-12-30 20:15:35Z

1

If you would use BeautifulSoup and specify the formatter=None, it should leave the source formatting as it was initially. Sample:

from bs4 import BeautifulSoup

my_document = """
<html>
<body>

    <h1>Some Heading</h1>

    <div id="first">
    <p>A paragraph.</p>
    <a href="another_doc.html">A link</a>
    <p>A paragraph.</p>
    </div>

    <div id="second">
    <p>A paragraph.</p>
    <p>A paragraph.</p>
    </div>

    <div id="third">
    <p>A paragraph.</p>
    <a href="another_doc.html">A link</a>
    <a href="yet_another_doc.html">A link</a>
    </div>

    <p id="loner">A paragraph.</p>

</body>
</html>
"""

soup = BeautifulSoup(my_document, "html.parser")

# removing a node
soup.find("div", id="second").extract()

modified_source = soup.encode(formatter=None)

I still think that it would attempt to fix the HTML during parsing, but see if this solution is good enough for your use case. Hope it helps.

answered Dec 30, 2015 at 20:15

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

3 Comments

Jason S Over a year ago

I'll give it a try -- thanks! (have patience if I don't accept this answer right away)

Jason S Over a year ago

Hmm. It doesn't work, sorry. formatter=None removes the spaces present in the original HTML. (either that or they're removed in the original parsing.)

alecxe Over a year ago

@JasonS yeah, I thought so, thanks for checking. What if you would apply the parser to the both documents you are comparing in a diff?..

Unknown Soldier · Accepted Answer · 2016-03-20 09:59:24Z

0

Hello I would use EHP for that. It builds a dom like object of the HTML document then you can insert/delete/search for specific entities in the html document. Once you have changed the DOM of the HTML then you just serialize it like.

https://github.com/iogf/ehp

Check out this example.

from ehp import *

data  = ''' <body><em> foo  </em></body>'''
dom  = Html().feed(data)

for ind in dom.find('em'):
    x = Tag('font', {'color':'red'})
    ind.append(x)

print dom

Output:

<body ><em > foo  <font color="red" ></font></em></body>

answered Mar 20, 2016 at 9:59

Unknown Soldier

372 bronze badges

Collectives™ on Stack Overflow

Is there a way to parse HTML in Python but preserve the source formatting?

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related