Escape special HTML characters in Python

Question

I have a string where special characters like ' or " or & (...) can appear. In the string:

string = """ Hello "XYZ" this 'is' a test & so on """

how can I automatically escape every special character, so that I get this:

string = " Hello &quot;XYZ&quot; this &#39;is&#39; a test &amp; so on "

kennytm · Accepted Answer · 2013-01-13 17:28:28Z

57

In Python 3.2, you could use the html.escape function, e.g.

>>> string = """ Hello "XYZ" this 'is' a test & so on """
>>> import html
>>> html.escape(string)
' Hello &quot;XYZ&quot; this &#x27;is&#x27; a test &amp; so on '

For earlier versions of Python, check http://wiki.python.org/moin/EscapingHtml:

The cgi module that comes with Python has an escape() function:
import cgi

s = cgi.escape( """& < >""" )   # s = "&amp; &lt; &gt;"
However, it doesn't escape characters beyond &, <, and >. If it is used as cgi.escape(string_to_escape, quote=True), it also escapes ".

Here's a small snippet that will let you escape quotes and apostrophes as well:
 html_escape_table = {
     "&": "&amp;",
     '"': "&quot;",
     "'": "&apos;",
     ">": "&gt;",
     "<": "&lt;",
     }

 def html_escape(text):
     """Produce entities within text."""
     return "".join(html_escape_table.get(c,c) for c in text)
You can also use escape() from xml.sax.saxutils to escape html. This function should execute faster. The unescape() function of the same module can be passed the same arguments to decode a string.
from xml.sax.saxutils import escape, unescape
# escape() and unescape() takes care of &, < and >.
html_escape_table = {
    '"': "&quot;",
    "'": "&apos;"
}
html_unescape_table = {v:k for k, v in html_escape_table.items()}

def html_escape(text):
    return escape(text, html_escape_table)

def html_unescape(text):
    return unescape(text, html_unescape_table)

edited Jan 13, 2013 at 17:28

answered Jan 16, 2010 at 12:30

kennytm

526k110 gold badges1.1k silver badges1k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

leetNightshade Over a year ago

Note, a number of your replacements aren't HTML compliant. One for example: w3.org/TR/xhtml1/#C_16 Instead of ', use ' I guess a few others were added to the HTML4 standard, but that one wasn't.

Дмитро Олександрович Over a year ago

I came here in search of a way to unescape special characters, and I found that the HTML module has the unescape() method :) html.unescape('some 'single quotation marks'')

Kaki In Dec 16, 2024 at 19:37

Quotes are not replaced if you omit to set a second argument to True (docs.python.org/3/library/html.html#html.escape)

Robert Christie · Accepted Answer · 2010-01-16 12:34:34Z

5

The cgi.escape method will convert special charecters to valid html tags

 import cgi
 original_string = 'Hello "XYZ" this \'is\' a test & so on '
 escaped_string = cgi.escape(original_string, True)
 print original_string
 print escaped_string

will result in

Hello "XYZ" this 'is' a test & so on 
Hello &quot;XYZ&quot; this 'is' a test &amp; so on

The optional second paramter on cgi.escape escapes quotes. By default, they are not escaped

answered Jan 16, 2010 at 12:34

Robert Christie

20.8k8 gold badges45 silver badges38 bronze badges

3 Comments

Ned Batchelder Over a year ago

I don't understand why cgi.escape is so squeamish about converting quotes, and ignores single quotes entirely.

Mike DeSimone Over a year ago

Because quotes do not need to be escaped in PCDATA, they do need to be escaped in attributes (which, far more often than not, use double quotes for delimiters), and the former case is far more common than the latter. In general, it's a textbook 90% solution (more like >99%). If you have to save every last byte and want it to dynamically figure out which type of quoting does so, use xml.sax.saxutils.quoteattr().

nigh_anxiety Over a year ago

Just a note that cgi is deprecated as of Python 3.11 and will be removed in Python 3.13. PEP 594

Ned Batchelder · Accepted Answer · 2010-01-16 13:10:04Z

4

A simple string function will do it:

def escape(t):
    """HTML-escape the text in `t`."""
    return (t
        .replace("&", "&amp;").replace("<", "&lt;").replace(">", "&gt;")
        .replace("'", "&#39;").replace('"', "&quot;")
        )

Other answers in this thread have minor problems: The cgi.escape method for some reason ignores single-quotes, and you need to explicitly ask it to do double-quotes. The wiki page linked does all five, but uses the XML entity ', which isn't an HTML entity.

This code function does all five all the time, using HTML-standard entities.

answered Jan 16, 2010 at 13:10

Ned Batchelder

378k77 gold badges583 silver badges675 bronze badges

Comments

Brōtsyorfuzthrāx · Accepted Answer · 2014-06-23 19:51:59Z

The other answers here will help with such as the characters you listed and a few others. However, if you also want to convert everything else to entity names, too, you'll have to do something else. For instance, if á needs to be converted to á, neither cgi.escape nor html.escape will help you there. You'll want to do something like this that uses html.entities.entitydefs, which is just a dictionary. (The following code is made for Python 3.x, but there's a partial attempt at making it compatible with 2.x to give you an idea):

# -*- coding: utf-8 -*-

import sys

if sys.version_info[0]>2:
    from html.entities import entitydefs
else:
    from htmlentitydefs import entitydefs

text=";\"áèïøæỳ" #This is your string variable containing the stuff you want to convert
text=text.replace(";", "$ஸ$") #$ஸ$ is just something random the user isn't likely to have in the document. We're converting it so it doesn't convert the semi-colons in the entity name into entity names.
text=text.replace("$ஸ$", "&semi;") #Converting semi-colons to entity names

if sys.version_info[0]>2: #Using appropriate code for each Python version.
    for k,v in entitydefs.items():
        if k not in {"semi", "amp"}:
            text=text.replace(v, "&"+k+";") #You have to add the & and ; manually.
else:
    for k,v in entitydefs.iteritems():
        if k not in {"semi", "amp"}:
            text=text.replace(v, "&"+k+";") #You have to add the & and ; manually.

#The above code doesn't cover every single entity name, although I believe it covers everything in the Latin-1 character set. So, I'm manually doing some common ones I like hereafter:
text=text.replace("ŷ", "&ycirc;")
text=text.replace("Ŷ", "&Ycirc;")
text=text.replace("ŵ", "&wcirc;")
text=text.replace("Ŵ", "&Wcirc;")
text=text.replace("ỳ", "&#7923;")
text=text.replace("Ỳ", "&#7922;")
text=text.replace("ẃ", "&wacute;")
text=text.replace("Ẃ", "&Wacute;")
text=text.replace("ẁ", "&#7809;")
text=text.replace("Ẁ", "&#7808;")

print(text)
#Python 3.x outputs: &semi;&quot;&aacute;&egrave;&iuml;&oslash;&aelig;&#7923;
#The Python 2.x version outputs the wrong stuff. So, clearly you'll have to adjust the code somehow for it.

Collectives™ on Stack Overflow

Escape special HTML characters in Python

4 Answers 4

3 Comments

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related