Right way to strip tags except some in python

Question

For example, i have HTML code, where contains codes like this

<a href="some" class="some" onclick="return false;">anchor</a>
<table id="some">
    <tr>
        <td class="some">
        </td>
    </tr>
</table>
<p class="" style="">content</p>

And i want remove all tags attributes and save only some tags (for example, remove table, tr, tr, th tags), so, i want get something like this.

<a href="some">anchor</a>
<table>
    <tr>
        <td>

        </td>
    </tr>
</table>
<p>content</p>

I do it using for loop, but my code retrieves each tag and cleans it. I think that my way slow.

What you can suggest me? Thanks.

Update #1

In my solution i use this code for removing tags (stealed from django)

def remove_tags(html, tags):
    """Returns the given HTML with given tags removed."""
    tags = [re.escape(tag) for tag in tags.split()]
    tags_re = '(%s)' % '|'.join(tags)
    starttag_re = re.compile(r'<%s(/?>|(\s+[^>]*>))' % tags_re, re.U)
    endtag_re = re.compile('</%s>' % tags_re)
    html = starttag_re.sub('', html)
    html = endtag_re.sub('', html)
    return html

And this code to clean HTML attributes

# But this code doesnt remove empty tags (without content ant etc.) like this `<div><img></div>`
import lxml.html.clean

html = 'Some html code'

safe_attrs = lxml.html.clean.defs.safe_attrs
cleaner = lxml.html.clean.Cleaner(safe_attrs_only=True, safe_attrs=frozenset())
html = cleaner.clean_html(html)

You say "i want remove all tags attributes" and then your example output starts with <a href="some"> where href of course is a tag attribute. This makes your request contradictory and thus impossible to satisfy. Please edit the question to remove the contradiction. — Alex Martelli
– Alex Martelli, Commented Dec 21, 2014 at 1:45

Padraic Cunningham · Accepted Answer · 2014-12-20 18:04:44Z

4

Use beautifulsoup.

html = """
<a href="some" class="some" onclick="return false;">anchor</a>
<table id="some">
    <tr>
        <td class="some">
        </td>
    </tr>
</table>
<p class="" style="">content</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

del soup.table.tr.td.attrs 
del soup.table.attrs 
print(soup.prettify())

<html>
 <body>
  <a class="some" href="some" onclick="return false;">
   anchor
  </a>
  <table>
   <tr>
    <td>
    </td>
   </tr>
  </table>
  <p class="" style="">
   content
  </p>
 </body>
</html>

To clear tags:

soup = BeautifulSoup(html)

soup.table.clear()
print(soup.prettify())

<html>
 <body>
  <a class="some" href="some" onclick="return false;">
   anchor
  </a>
  <table id="some">
  </table>
  <p class="" style="">
   content
  </p>
 </body>
</html>

To delete particulat attribute:

soup = BeautifulSoup(html)

td_tag =  soup.table.td
del td_tag['class']
print(soup.prettify())

<html>
 <body>
  <a class="some" href="some" onclick="return false;">
   anchor
  </a>
  <table id="some">
   <tr>
    <td>
    </td>
   </tr>
  </table>
  <p class="" style="">
   content
  </p>
 </body>
</html>

edited Dec 20, 2014 at 18:04

answered Dec 20, 2014 at 16:54

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Patrick Burns Over a year ago

In any way, using BeautifulSoup is very slow way, insted of BeautifulSoup i using lxml. And in yours example i see, that for each tag element, that i want to clean, i must set it to None.

Patrick Burns Over a year ago

I use re only for removing tags.

Padraic Cunningham Over a year ago

@AlexAntonov, how much time does it take?

Patrick Burns Over a year ago

Code that i pasted above executes a few seconds.

frainfreeze · Accepted Answer · 2014-12-20 16:55:14Z

1

What you are looking for is called parsing.

BeautifulSoup is one of most popular / most used libraries for parsing html. You can use it to remove tags and it is pretty well documented.

If you (because of some reason) can not use BeautifulSoup then look into python re module.

answered Dec 20, 2014 at 16:55

frainfreeze

5776 silver badges20 bronze badges

1 Comment

Patrick Burns Over a year ago

For cleaning and parsing HTML, i use only lxml and re, because BS4 is very slow decision.

Collectives™ on Stack Overflow

Right way to strip tags except some in python

2 Answers 2

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related