1

Is there a way to parse HTML in Python but preserve the source formatting?

I want to iterate through an input file and produce an output file that is a byte-for-byte match of the input file, except for some new elements inserted in certain places.

I looked at the HTMLParser docs but I don't see any options to preserve formatting.

2
  • You want the resulting string to match the source string except for your insertion of new elements. I don't think that is possible, we use lxml a lot and use the tostring method a lot but if there is any bad html in the source then there is an attempt to fix it in tostring. Have you tried using tostring? Maybe your html is clean enough? Commented Dec 30, 2015 at 20:13
  • I want this to make it easy for my team to see diffs between input and output code. This is definitely possible, I just don't know which library (if any) can do this for HTML code. srcML does this for C code, and it's just a matter for any parser to keep the raw input content for each node in the parse tree. Commented Dec 30, 2015 at 21:05

3 Answers 3

2

Looks like I can use HTMLParser except for a few fringe issues (bogus comments and nonstandard end tags) by subclassing the following class's onStartTag() and onEndTag() methods.

from HTMLParser import HTMLParser

class VerbatimParser(HTMLParser):
    def __init__(self, out):
        HTMLParser.__init__(self)
        # @#%#@% HTMLParser uses old-style classes, can't use super()
        self.out = out
        self.tagstack = []
    def emit(self, text):
        self.out.write(text)
    def handle_starttag(self, tag, attrs):
        self.tagstack.append(tag)
        self.emit(self.get_starttag_text())
        self.onStartTag(tag, attrs)
    def onStartTag(self, tag, attrs):
        pass
    def onEndTag(self, tag):
        pass
    def handle_endtag(self, tag):
        self.onEndTag(tag)
        # pop last occurrence of tag, along with any more recent tags
        try:
            k = self.tagstack[::-1].index(tag)
            del self.tagstack[-k-1:]
        except ValueError:
            pass        
        self.emit('</')
        self.emit(tag)
        self.emit('>')
    def handle_startendtag(self, tag, attrs):
        self.emit(self.get_starttag_text())
    def handle_data(self, data):
        self.emit(data)
    def handle_entityref(self, name):
        self.emit('&')
        self.emit(name)
        self.emit(';')
    def handle_charref(self, name):
        self.emit('&#')
        self.emit(name)
        self.emit(';')
    def handle_comment(self, data):
        self.emit('<!--')
        self.emit(data)
        self.emit('-->')
    def handle_decl(self, decl):
        self.emit('<!')
        self.emit(decl)
        self.emit('>')
    def handle_pi(self, data):
        self.emit('<?')
        self.emit(data)
        self.emit('>')
    def unknown_decl(self, data):
        self.emit('<![')
        self.emit(data)
        self.emit(']>')

def doit(infile, outfile):
    with open(outfile,'w') as fout:
        parser = VerbatimParser(fout)
        with open(infile) as f:
            parser.feed(f.read())
            parser.close()
Sign up to request clarification or add additional context in comments.

Comments

1

If you would use BeautifulSoup and specify the formatter=None, it should leave the source formatting as it was initially. Sample:

from bs4 import BeautifulSoup

my_document = """
<html>
<body>

    <h1>Some Heading</h1>

    <div id="first">
    <p>A paragraph.</p>
    <a href="another_doc.html">A link</a>
    <p>A paragraph.</p>
    </div>

    <div id="second">
    <p>A paragraph.</p>
    <p>A paragraph.</p>
    </div>

    <div id="third">
    <p>A paragraph.</p>
    <a href="another_doc.html">A link</a>
    <a href="yet_another_doc.html">A link</a>
    </div>

    <p id="loner">A paragraph.</p>

</body>
</html>
"""

soup = BeautifulSoup(my_document, "html.parser")

# removing a node
soup.find("div", id="second").extract()

modified_source = soup.encode(formatter=None)

I still think that it would attempt to fix the HTML during parsing, but see if this solution is good enough for your use case. Hope it helps.

3 Comments

I'll give it a try -- thanks! (have patience if I don't accept this answer right away)
Hmm. It doesn't work, sorry. formatter=None removes the spaces present in the original HTML. (either that or they're removed in the original parsing.)
@JasonS yeah, I thought so, thanks for checking. What if you would apply the parser to the both documents you are comparing in a diff?..
0

Hello I would use EHP for that. It builds a dom like object of the HTML document then you can insert/delete/search for specific entities in the html document. Once you have changed the DOM of the HTML then you just serialize it like.

https://github.com/iogf/ehp

Check out this example.

from ehp import *

data  = ''' <body><em> foo  </em></body>'''
dom  = Html().feed(data)

for ind in dom.find('em'):
    x = Tag('font', {'color':'red'})
    ind.append(x)

print dom

Output:

<body ><em > foo  <font color="red" ></font></em></body>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.