29

I have a string where special characters like ' or " or & (...) can appear. In the string:

string = """ Hello "XYZ" this 'is' a test & so on """

how can I automatically escape every special character, so that I get this:

string = " Hello "XYZ" this 'is' a test & so on "

4 Answers 4

57

In Python 3.2, you could use the html.escape function, e.g.

>>> string = """ Hello "XYZ" this 'is' a test & so on """
>>> import html
>>> html.escape(string)
' Hello "XYZ" this 'is' a test & so on '

For earlier versions of Python, check http://wiki.python.org/moin/EscapingHtml:

The cgi module that comes with Python has an escape() function:

import cgi

s = cgi.escape( """& < >""" )   # s = "&amp; &lt; &gt;"

However, it doesn't escape characters beyond &, <, and >. If it is used as cgi.escape(string_to_escape, quote=True), it also escapes ".


Here's a small snippet that will let you escape quotes and apostrophes as well:

 html_escape_table = {
     "&": "&amp;",
     '"': "&quot;",
     "'": "&apos;",
     ">": "&gt;",
     "<": "&lt;",
     }

 def html_escape(text):
     """Produce entities within text."""
     return "".join(html_escape_table.get(c,c) for c in text)

You can also use escape() from xml.sax.saxutils to escape html. This function should execute faster. The unescape() function of the same module can be passed the same arguments to decode a string.

from xml.sax.saxutils import escape, unescape
# escape() and unescape() takes care of &, < and >.
html_escape_table = {
    '"': "&quot;",
    "'": "&apos;"
}
html_unescape_table = {v:k for k, v in html_escape_table.items()}

def html_escape(text):
    return escape(text, html_escape_table)

def html_unescape(text):
    return unescape(text, html_unescape_table)
Sign up to request clarification or add additional context in comments.

3 Comments

Note, a number of your replacements aren't HTML compliant. One for example: w3.org/TR/xhtml1/#C_16 Instead of &apos;, use &#39; I guess a few others were added to the HTML4 standard, but that one wasn't.
I came here in search of a way to unescape special characters, and I found that the HTML module has the unescape() method :) html.unescape('some &#x27;single quotation marks&#x27;')
Quotes are not replaced if you omit to set a second argument to True (docs.python.org/3/library/html.html#html.escape)
5

The cgi.escape method will convert special charecters to valid html tags

 import cgi
 original_string = 'Hello "XYZ" this \'is\' a test & so on '
 escaped_string = cgi.escape(original_string, True)
 print original_string
 print escaped_string

will result in

Hello "XYZ" this 'is' a test & so on 
Hello &quot;XYZ&quot; this 'is' a test &amp; so on 

The optional second paramter on cgi.escape escapes quotes. By default, they are not escaped

3 Comments

I don't understand why cgi.escape is so squeamish about converting quotes, and ignores single quotes entirely.
Because quotes do not need to be escaped in PCDATA, they do need to be escaped in attributes (which, far more often than not, use double quotes for delimiters), and the former case is far more common than the latter. In general, it's a textbook 90% solution (more like >99%). If you have to save every last byte and want it to dynamically figure out which type of quoting does so, use xml.sax.saxutils.quoteattr().
Just a note that cgi is deprecated as of Python 3.11 and will be removed in Python 3.13. PEP 594
4

A simple string function will do it:

def escape(t):
    """HTML-escape the text in `t`."""
    return (t
        .replace("&", "&amp;").replace("<", "&lt;").replace(">", "&gt;")
        .replace("'", "&#39;").replace('"', "&quot;")
        )

Other answers in this thread have minor problems: The cgi.escape method for some reason ignores single-quotes, and you need to explicitly ask it to do double-quotes. The wiki page linked does all five, but uses the XML entity &apos;, which isn't an HTML entity.

This code function does all five all the time, using HTML-standard entities.

Comments

2

The other answers here will help with such as the characters you listed and a few others. However, if you also want to convert everything else to entity names, too, you'll have to do something else. For instance, if á needs to be converted to &aacute;, neither cgi.escape nor html.escape will help you there. You'll want to do something like this that uses html.entities.entitydefs, which is just a dictionary. (The following code is made for Python 3.x, but there's a partial attempt at making it compatible with 2.x to give you an idea):

# -*- coding: utf-8 -*-

import sys

if sys.version_info[0]>2:
    from html.entities import entitydefs
else:
    from htmlentitydefs import entitydefs

text=";\"áèïøæỳ" #This is your string variable containing the stuff you want to convert
text=text.replace(";", "$ஸ$") #$ஸ$ is just something random the user isn't likely to have in the document. We're converting it so it doesn't convert the semi-colons in the entity name into entity names.
text=text.replace("$ஸ$", "&semi;") #Converting semi-colons to entity names

if sys.version_info[0]>2: #Using appropriate code for each Python version.
    for k,v in entitydefs.items():
        if k not in {"semi", "amp"}:
            text=text.replace(v, "&"+k+";") #You have to add the & and ; manually.
else:
    for k,v in entitydefs.iteritems():
        if k not in {"semi", "amp"}:
            text=text.replace(v, "&"+k+";") #You have to add the & and ; manually.

#The above code doesn't cover every single entity name, although I believe it covers everything in the Latin-1 character set. So, I'm manually doing some common ones I like hereafter:
text=text.replace("ŷ", "&ycirc;")
text=text.replace("Ŷ", "&Ycirc;")
text=text.replace("ŵ", "&wcirc;")
text=text.replace("Ŵ", "&Wcirc;")
text=text.replace("ỳ", "&#7923;")
text=text.replace("Ỳ", "&#7922;")
text=text.replace("ẃ", "&wacute;")
text=text.replace("Ẃ", "&Wacute;")
text=text.replace("ẁ", "&#7809;")
text=text.replace("Ẁ", "&#7808;")

print(text)
#Python 3.x outputs: &semi;&quot;&aacute;&egrave;&iuml;&oslash;&aelig;&#7923;
#The Python 2.x version outputs the wrong stuff. So, clearly you'll have to adjust the code somehow for it.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.