For example, i have HTML code, where contains codes like this
<a href="some" class="some" onclick="return false;">anchor</a>
<table id="some">
<tr>
<td class="some">
</td>
</tr>
</table>
<p class="" style="">content</p>
And i want remove all tags attributes and save only some tags (for example, remove table, tr, tr, th tags), so, i want get something like this.
<a href="some">anchor</a>
<table>
<tr>
<td>
</td>
</tr>
</table>
<p>content</p>
I do it using for loop, but my code retrieves each tag and cleans it. I think that my way slow.
What you can suggest me? Thanks.
Update #1
In my solution i use this code for removing tags (stealed from django)
def remove_tags(html, tags):
"""Returns the given HTML with given tags removed."""
tags = [re.escape(tag) for tag in tags.split()]
tags_re = '(%s)' % '|'.join(tags)
starttag_re = re.compile(r'<%s(/?>|(\s+[^>]*>))' % tags_re, re.U)
endtag_re = re.compile('</%s>' % tags_re)
html = starttag_re.sub('', html)
html = endtag_re.sub('', html)
return html
And this code to clean HTML attributes
# But this code doesnt remove empty tags (without content ant etc.) like this `<div><img></div>`
import lxml.html.clean
html = 'Some html code'
safe_attrs = lxml.html.clean.defs.safe_attrs
cleaner = lxml.html.clean.Cleaner(safe_attrs_only=True, safe_attrs=frozenset())
html = cleaner.clean_html(html)
<a href="some">wherehrefof course is a tag attribute. This makes your request contradictory and thus impossible to satisfy. Please edit the question to remove the contradiction.