Simple .html filter in python - modify text elements only

Question

I need to filter a rather long (but very regular) set of .html files to modify a few constructs only if they appear in text elements.

One good example is to change <p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his “good” side! He has <i>none</i>!<div></p>.

I can easily parse my files with html.parser, but it's unclear how to generate result file, which should be as similar to input as possible (no reformatting).

I had a look to beautiful-soup, but it really seems too big for this (supposedly?) simple task.

Note: I do not need/want to serve .html files to a browser of any kind; I just need them updated (possibli in-place) with (slightly) changed content.

UPDATE:

Following @soundstripe advice Iwrote the following code:

import bs4
from re import sub

def handle_html(html):
    sp = bs4.BeautifulSoup(html, features='html.parser')
    for e in list(sp.strings):
        s = sub(r'"([^"]+)"', r'&ldquo;\1&rdquo;', e)
        if s != e:
            e.replace_with(s)
    return str(sp).encode()

raw = b"""<p><div class="speech">it's hard to "find" his "good" side! He has <i>none</i>!<div></p>"""
new = handle_html(raw)
print(raw)
print(new)

Unfortunately BeautifulSoup tries to be too smart from its (and my) own good:

b'<p><div class="speech">it\'s hard to "find" his "good" side! He has <i>none</i>!<div></p>'
b'<p><div class="speech">it\'s hard to &amp;ldquo;find&amp;rdquo; his &amp;ldquo;good&amp;rdquo; side! He has <i>none</i>!<div></div></div></p>'

i.e.: it transforms plain & to & thus breaking “ entity (notice I'm working with bytearrays, not strings. Is it relevant?).

How can I fix this?

@Code_Ninja: at first glance it looks even more use-cannon-to-swat-a-fly than beautiful-soup. Did I miss something? — ZioByte
– ZioByte, Commented May 7, 2019 at 13:37
haha, dont be scared of the API, selenium webdriver gives you more features than beautiful-soup, as it's main aim of creation is to track and automate changes on a website at front end level. — Code_Ninja
– Code_Ninja, Commented May 7, 2019 at 13:43

soundstripe · Accepted Answer · 2019-05-07 16:50:52Z

1

I don't know why you wouldn't use BeautifulSoup. Here's an example that replaces your quotes like you're asking.

import re
import bs4

raw = b"""<p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his &ldquo;good&rdquo; side! He has <i>none</i>!<div></p>"""
soup = bs4.BeautifulSoup(raw, features='html.parser')

def replace_quotes(s):
    return re.sub(r'"([^"]+)"', r'&ldquo;\1&rdquo;', e)


for e in list(soup.strings):
    # wrapping the new string in BeautifulSoup() call to correctly parse entities
    new_string = bs4.BeautifulSoup(replace_quotes(e))
    e.replace_with(new_string)

# use the soup.encode() formatter keyword to specify you want html entities in your output
new = soup.encode(formatter='html')


print(raw)
print(new)

edited May 7, 2019 at 16:50

answered May 7, 2019 at 14:19

soundstripe

1,48411 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

ZioByte Over a year ago

Please see updated Question; we are almost there... but not quite.

soundstripe Over a year ago

Normally you'd need to open another post (and do a little research first) to ask a different question but I'm feeling nice today :P

soundstripe Over a year ago

I'll point out this re.sub() pattern will only work on matched pairs of quotes in a single HTML string. You probably want something more like what Word does for smart quotes-- if the quote is followed by a letter, it should be a left quote. If it is followed by a space or punctuation, it should be a right quote.

ZioByte Over a year ago

Cure is worse than illness. Using bs4.BeautifulSoup(replace_quotes(e)) wraps string in <html><body><p>...</p></body></html>; the outer <html><body>...</body></html> are removed by replace_with, but <p>...</p> remains and breaks havoc with formatting. I will accept Your answer because i made it work simply changing “ with “ as I don't really care for html entities in final product. Thanks.

ZioByte Over a year ago

I am aware of limitations of re.sub(...), but I still think this is the best option; one of my actual snippets reads:

<br/>« <span class="speech">Ho visto io quello che è successo. La ruota è passata su un pietrone che la ha spostata di un palmo. Un palmo più in là c’era il vuoto. Non ho fatto a tempo nemmeno a dirti "attento!"</span> »<br/>« <span class="speech">Andiamo a vedere?</span> »<br/>

it's not very obvious what to do with the "closing" quotes as it's surrounded by punctuation (and this is only the very first hit). Any insight welcome.

Collectives™ on Stack Overflow

Simple .html filter in python - modify text elements only

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related