4

For example, i have HTML code, where contains codes like this

<a href="some" class="some" onclick="return false;">anchor</a>
<table id="some">
    <tr>
        <td class="some">
        </td>
    </tr>
</table>
<p class="" style="">content</p>

And i want remove all tags attributes and save only some tags (for example, remove table, tr, tr, th tags), so, i want get something like this.

<a href="some">anchor</a>
<table>
    <tr>
        <td>

        </td>
    </tr>
</table>
<p>content</p>

I do it using for loop, but my code retrieves each tag and cleans it. I think that my way slow.

What you can suggest me? Thanks.

Update #1

In my solution i use this code for removing tags (stealed from django)

def remove_tags(html, tags):
    """Returns the given HTML with given tags removed."""
    tags = [re.escape(tag) for tag in tags.split()]
    tags_re = '(%s)' % '|'.join(tags)
    starttag_re = re.compile(r'<%s(/?>|(\s+[^>]*>))' % tags_re, re.U)
    endtag_re = re.compile('</%s>' % tags_re)
    html = starttag_re.sub('', html)
    html = endtag_re.sub('', html)
    return html

And this code to clean HTML attributes

# But this code doesnt remove empty tags (without content ant etc.) like this `<div><img></div>`
import lxml.html.clean

html = 'Some html code'

safe_attrs = lxml.html.clean.defs.safe_attrs
cleaner = lxml.html.clean.Cleaner(safe_attrs_only=True, safe_attrs=frozenset())
html = cleaner.clean_html(html)
3
  • 1
    If you want feedback, share your code. Commented Dec 20, 2014 at 16:20
  • @user590028, i add my code to question. Commented Dec 20, 2014 at 17:00
  • You say "i want remove all tags attributes" and then your example output starts with <a href="some"> where href of course is a tag attribute. This makes your request contradictory and thus impossible to satisfy. Please edit the question to remove the contradiction. Commented Dec 21, 2014 at 1:45

2 Answers 2

4

Use beautifulsoup.

html = """
<a href="some" class="some" onclick="return false;">anchor</a>
<table id="some">
    <tr>
        <td class="some">
        </td>
    </tr>
</table>
<p class="" style="">content</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

del soup.table.tr.td.attrs 
del soup.table.attrs 
print(soup.prettify())

<html>
 <body>
  <a class="some" href="some" onclick="return false;">
   anchor
  </a>
  <table>
   <tr>
    <td>
    </td>
   </tr>
  </table>
  <p class="" style="">
   content
  </p>
 </body>
</html>

To clear tags:

soup = BeautifulSoup(html)

soup.table.clear()
print(soup.prettify())

<html>
 <body>
  <a class="some" href="some" onclick="return false;">
   anchor
  </a>
  <table id="some">
  </table>
  <p class="" style="">
   content
  </p>
 </body>
</html>

To delete particulat attribute:

soup = BeautifulSoup(html)

td_tag =  soup.table.td
del td_tag['class']
print(soup.prettify())

<html>
 <body>
  <a class="some" href="some" onclick="return false;">
   anchor
  </a>
  <table id="some">
   <tr>
    <td>
    </td>
   </tr>
  </table>
  <p class="" style="">
   content
  </p>
 </body>
</html>
Sign up to request clarification or add additional context in comments.

4 Comments

In any way, using BeautifulSoup is very slow way, insted of BeautifulSoup i using lxml. And in yours example i see, that for each tag element, that i want to clean, i must set it to None.
I use re only for removing tags.
@AlexAntonov, how much time does it take?
Code that i pasted above executes a few seconds.
1

What you are looking for is called parsing.

BeautifulSoup is one of most popular / most used libraries for parsing html. You can use it to remove tags and it is pretty well documented.

If you (because of some reason) can not use BeautifulSoup then look into python re module.

1 Comment

For cleaning and parsing HTML, i use only lxml and re, because BS4 is very slow decision.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.