Remove HTML tag contents from page using Python

Question

I have an HTML file like the one below:

<!DOCTYPE HTML>
<html>

<head>

<title>Sezione microbiologia</title>
<link rel="stylesheet" src="./style.css">

</head>

<body>

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Prima diluizione</h1>
        <p>Some content including "prima diluizione"...</p>
        <h1>Seconda diluizione</h1>
        <p>Some content including "seconda diluizione"...</p>
        <h1>Terza diluizione</h1>
        <p>Some content including "terza diluizione"...</p>
    </section>

    <section id="second">
        <!-- SOME CONTENT... -->
    </section>

    <section id="third">
        <!-- SOME CONTENT... -->
    </section>

    <section id="footer">
        <!-- SOME CONTENT... -->
    </section>
</div>
</body>

</html>

Problem description:

I am trying to modify the headings <h1> that contain the the word diluizione to replace this word and its prefix with "Diluizione seriale". I tried to do this using Python replace(), the problem is that even lines in the <p> paragraphs are cut off, whilst I would only like lines in the h1 tags to be modified. On top of that, I still have not managed to find a way to automated taking out the prefix, ie "Prima", "Seconda", "Terza", etc.

The code I tried with

I currently came up with this:

with open('./home.html') as file:
    text = file.read()


if "diluizione" in text:
    text = text.replace("diluizione", "diluizione seriale")

But this outputs:

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Prima diluizione seriale</h1>
        <p>Some content including "prima diluizione seriale"...</p>
        <h1>Seconda diluizione seriale</h1>
        <p>Some content including "seconda diluizione seriale"...</p>
        <h1>Terza diluizione seriale</h1>
        <p>Some content including "terza diluizione seriale"...</p>
    </section>

So as you can see, even text in the <p> tags is affected and the headings the prefix is still there.

My desired output would be:

<div id="content">
    <section id="main">
        <!-- SOME CONTENT... -->
        <h1>Diluizione seriale</h1>
        <p>Some content including "prima diluizione"...</p>
        <h1>Diluizione seriale</h1>
        <p>Some content including "seconda diluizione"...</p>
        <h1>Diluizione seriale</h1>
        <p>Some content including "terza diluizione"...</p>
    </section>

Any help or suggestion is very appreciated, thanks very much in advance.

BiOS · Accepted Answer · 2021-03-24 19:39:36Z

2

You could use the regex through Pythons re module to achieve this. In order to only filter text within the h1 tags, you may use a positive lookbehind and a positive lookahead strategy.

Code:

import re

with open("path/to/home.html") as file:
    text = file.read()

text = re.sub("(?<=<h1>)\w+ \w+(?=</h1>)", "Diluizione seriale", text)

print(text)

Explanation:

The regular expression (?<=<h1>)\w+ \w+(?=</h1>) matches two consecutive word characters contained between <h1> and </h1>.

Output:

<!-- SOME CONTENT... -->
<h1>Diluizione seriale</h1>
<p>Some content including "prima diluizione"...</p>
<h1>Diluizione seriale</h1>
<p>Some content including "seconda diluizione"...</p>
<h1>Diluizione seriale</h1>
<p>Some content including "terza diluizione"...</p>

answered Mar 24, 2021 at 19:39

BiOS

2,3043 gold badges13 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Keegan Cowle · Accepted Answer · 2021-03-24 19:37:10Z

1

Have a look at html.parser. Instead of trying to do sting interpolation, rather parse the HTML into a structure and then traverse it from there

answered Mar 24, 2021 at 19:37

Keegan Cowle

2,5902 gold badges16 silver badges36 bronze badges

1 Comment

occhietto Over a year ago

Thanks for your answer! I have opted for the regex strategy, but will definitely take a look to html parsers. Thanks for your time.

Collectives™ on Stack Overflow

Remove HTML tag contents from page using Python

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related