0

I need to filter a rather long (but very regular) set of .html files to modify a few constructs only if they appear in text elements.

One good example is to change <p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his &ldquo;good&rdquo; side! He has <i>none</i>!<div></p>.

I can easily parse my files with html.parser, but it's unclear how to generate result file, which should be as similar to input as possible (no reformatting).

I had a look to beautiful-soup, but it really seems too big for this (supposedly?) simple task.

Note: I do not need/want to serve .html files to a browser of any kind; I just need them updated (possibli in-place) with (slightly) changed content.

UPDATE:

Following @soundstripe advice Iwrote the following code:

import bs4
from re import sub

def handle_html(html):
    sp = bs4.BeautifulSoup(html, features='html.parser')
    for e in list(sp.strings):
        s = sub(r'"([^"]+)"', r'&ldquo;\1&rdquo;', e)
        if s != e:
            e.replace_with(s)
    return str(sp).encode()

raw = b"""<p><div class="speech">it's hard to "find" his "good" side! He has <i>none</i>!<div></p>"""
new = handle_html(raw)
print(raw)
print(new)

Unfortunately BeautifulSoup tries to be too smart from its (and my) own good:

b'<p><div class="speech">it\'s hard to "find" his "good" side! He has <i>none</i>!<div></p>'
b'<p><div class="speech">it\'s hard to &amp;ldquo;find&amp;rdquo; his &amp;ldquo;good&amp;rdquo; side! He has <i>none</i>!<div></div></div></p>'

i.e.: it transforms plain & to &amp; thus breaking &ldquo; entity (notice I'm working with bytearrays, not strings. Is it relevant?).

How can I fix this?

3
  • you can use selenium webdriver for that Commented May 7, 2019 at 13:29
  • @Code_Ninja: at first glance it looks even more use-cannon-to-swat-a-fly than beautiful-soup. Did I miss something? Commented May 7, 2019 at 13:37
  • haha, dont be scared of the API, selenium webdriver gives you more features than beautiful-soup, as it's main aim of creation is to track and automate changes on a website at front end level. Commented May 7, 2019 at 13:43

1 Answer 1

1

I don't know why you wouldn't use BeautifulSoup. Here's an example that replaces your quotes like you're asking.

import re
import bs4

raw = b"""<p><div class="speech">it's hard to find his "good" side! He has <i>none</i>!<div></p> to <p><div class="speech">it's hard to find his &ldquo;good&rdquo; side! He has <i>none</i>!<div></p>"""
soup = bs4.BeautifulSoup(raw, features='html.parser')

def replace_quotes(s):
    return re.sub(r'"([^"]+)"', r'&ldquo;\1&rdquo;', e)


for e in list(soup.strings):
    # wrapping the new string in BeautifulSoup() call to correctly parse entities
    new_string = bs4.BeautifulSoup(replace_quotes(e))
    e.replace_with(new_string)

# use the soup.encode() formatter keyword to specify you want html entities in your output
new = soup.encode(formatter='html')


print(raw)
print(new)
Sign up to request clarification or add additional context in comments.

5 Comments

Please see updated Question; we are almost there... but not quite.
Normally you'd need to open another post (and do a little research first) to ask a different question but I'm feeling nice today :P
I'll point out this re.sub() pattern will only work on matched pairs of quotes in a single HTML string. You probably want something more like what Word does for smart quotes-- if the quote is followed by a letter, it should be a left quote. If it is followed by a space or punctuation, it should be a right quote.
Cure is worse than illness. Using bs4.BeautifulSoup(replace_quotes(e)) wraps string in <html><body><p>...</p></body></html>; the outer <html><body>...</body></html> are removed by replace_with, but <p>...</p> remains and breaks havoc with formatting. I will accept Your answer because i made it work simply changing &ldquo; with as I don't really care for html entities in final product. Thanks.
I am aware of limitations of re.sub(...), but I still think this is the best option; one of my actual snippets reads: <br/>« <span class="speech">Ho visto io quello che è successo. La ruota è passata su un pietrone che la ha spostata di un palmo. Un palmo più in là c’era il vuoto. Non ho fatto a tempo nemmeno a dirti "attento!"</span> »<br/>« <span class="speech">Andiamo a vedere?</span> »<br/> it's not very obvious what to do with the "closing" quotes as it's surrounded by punctuation (and this is only the very first hit). Any insight welcome.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.