I need to filter a rather long (but very regular) set of .html files to modify a few constructs only if they appear in text elements.
One good example is to change <p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his “good” side! He has <i>none</i>!<div></p>.
I can easily parse my files with html.parser, but it's unclear how to generate result file, which should be as similar to input as possible (no reformatting).
I had a look to beautiful-soup, but it really seems too big for this (supposedly?) simple task.
Note: I do not need/want to serve .html files to a browser of any kind; I just need them updated (possibli in-place) with (slightly) changed content.
UPDATE:
Following @soundstripe advice Iwrote the following code:
import bs4
from re import sub
def handle_html(html):
sp = bs4.BeautifulSoup(html, features='html.parser')
for e in list(sp.strings):
s = sub(r'"([^"]+)"', r'“\1”', e)
if s != e:
e.replace_with(s)
return str(sp).encode()
raw = b"""<p><div class="speech">it's hard to "find" his "good" side! He has <i>none</i>!<div></p>"""
new = handle_html(raw)
print(raw)
print(new)
Unfortunately BeautifulSoup tries to be too smart from its (and my) own good:
b'<p><div class="speech">it\'s hard to "find" his "good" side! He has <i>none</i>!<div></p>'
b'<p><div class="speech">it\'s hard to &ldquo;find&rdquo; his &ldquo;good&rdquo; side! He has <i>none</i>!<div></div></div></p>'
i.e.: it transforms plain & to & thus breaking “ entity (notice I'm working with bytearrays, not strings. Is it relevant?).
How can I fix this?