How to convert a python string

Question

How would I convert this string

'\\n    this is a docstring for\\n    the main function.\\n    a,\\n    b,\\n    c\\n    '

into

'\n    this is a docstring for\n    the main function.\n    a,\n    b,\n    c\n    '

keeping in mind I would also like to do this for '\t' and all other escaped characters. The code for the reverse way is

def fix_string(s):
    """ takes the string and replaces any `\n` with `\\n` so that the read file will be recognized """
    # escape chars = \t , \b , \n , \r , \f , \' , \" , \\
    new_s = ''
    for i in s:
            if i == '\t':
                    new_s += '\\t'
            elif i == '\b':
                    new_s += '\\b'
            elif i == '\n':
                    new_s += '\\n'
            elif i == '\r':
                    new_s += '\\r'
            elif i == '\f':
                    new_s += '\\f'
            elif i == '\'':
                    new_s += "\\'"
            elif i == '\"':
                    new_s += '\\"'
            else:
                    new_s += i
    return new_s

would I possibly need to look at the actual numeric values for the characters and check the next character say if I find a ('\',92) character followed by a ('n',110)?

@TheSoundDefense - no. I am just giving an example of how I would do this the reverse way. — baallezx
– baallezx, Commented Jul 16, 2014 at 17:38
Does your string actually contain the three characters '\\n'? Or is it appearing in some escaped form? — Jonathon Reinhart
– Jonathon Reinhart, Commented Jul 16, 2014 at 17:39

Martijn Pieters · Accepted Answer · 2014-07-16 18:30:34Z

4

Don't reinvent the wheel here. Python has your back. Besides, handling escape syntax correctly, is harder than it looks.

The correct way to handle this

In Python 2, use the str-to-str string_escape codec:

string.decode('string_escape')

This interprets any Python-recognized string escape sequences for you, including \n and \t.

Demo:

>>> string = '\\n    this is a docstring for\\n    the main function.\\n    a,\\n    b,\\n    c\\n    '
>>> string.decode('string_escape')
'\n    this is a docstring for\n    the main function.\n    a,\n    b,\n    c\n    '
>>> print string.decode('string_escape')

    this is a docstring for
    the main function.
    a,
    b,
    c

>>> '\\t\\n\\r\\xa0\\040'.decode('string_escape')
'\t\n\r\xa0 '

In Python 3, you'd have to use the codecs.decode() and the unicode_escape codec:

codecs.decode(string, 'unicode_escape')

as there is no str.decode() method and this is not a str -> bytes conversion.

Demo:

>>> import codecs
>>> string = '\\n    this is a docstring for\\n    the main function.\\n    a,\\n    b,\\n    c\\n    '
>>> codecs.decode(string, 'unicode_escape')
'\n    this is a docstring for\n    the main function.\n    a,\n    b,\n    c\n    '
>>> print(codecs.decode(string, 'unicode_escape'))

    this is a docstring for
    the main function.
    a,
    b,
    c

>>> codecs.decode('\\t\\n\\r\\xa0\\040', 'unicode_escape')
'\t\n\r\xa0 '

Why straightforward `str.replace()` won't cut it

You could try to do this yourself with str.replace(), but then you also need to implement proper escape parsing; take \\\\n for example; this is \\n, escaped. If you naively apply str.replace() in sequence, you end up with \n or \\\n instead:

>>> '\\\\n'.decode('string_escape')
'\\n'
>>> '\\\\n'.replace('\\n', '\n').replace('\\\\', '\\')
'\\\n'
>>> '\\\\n'.replace('\\\\', '\\').replace('\\n', '\n')
'\n'

The \\ pair should be replaced by just one \ characters, leaving the n uninterpreted. But the replace option either will end up replacing the trailing \ together with the n with a newline character, or you end up with \\ replaced by \, and then the \ and the n are replaced by a newline. Either way, you end up with the wrong output.

The slow way to handle this, manually

You'll have to process the characters one by one instead, pulling in more characters as needed:

_map = {
    '\\\\': '\\',
    "\\'": "'",
    '\\"': '"',
    '\\a': '\a',
    '\\b': '\b',
    '\\f': '\f',
    '\\n': '\n',
    '\\r': '\r',
    '\\t': '\t',
}

def unescape_string(s):
    output = []
    i = 0
    while i < len(s):
        c = s[i]
        i += 1
        if c != '\\':
            output.append(c)
            continue
        c += s[i]
        i += 1
        if c in _map:
            output.append(_map[c])
            continue
        if c == '\\x' and i < len(s) - 2:  # hex escape
            point = int(s[i] + s[i + 1], 16)
            i += 2
            output.append(chr(point))
            continue
        if c == '\\0':  # octal escape
            while len(c) < 4 and i < len(s) and s[i].isdigit():
                c += s[i]
                i += 1
            point = int(c[1:], 8)
            output.append(chr(point))
    return ''.join(output)

This now can handle the \xhh and the standard 1-letter escapes, but not the \0.. octal escape sequences, or \uhhhh Unicode code points, or \N{name} unicode name references, nor does it handle malformed escapes in quite the same way as Python would.

But it does handle the escaped escape properly:

>>> unescape_string(string)
'\n    this is a docstring for\n    the main function.\n    a,\n    b,\n    c\n    '
>>> unescape_string('\\\\n')
'\\n'

Do know this is far slower than using the built-in codec.

edited Jul 16, 2014 at 18:30

answered Jul 16, 2014 at 17:40

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

wnnmaw Over a year ago

Not my downvote, but its probably because the argument for string.decode() is wrong

Eugene K Over a year ago

I'd say it's also over engineered and removes the ability for the asker to learn the fundamentals of something like a "string replace".

Martijn Pieters Over a year ago

@EugeneK: How is this overengineered? The codec exists for just this purpose.

Martijn Pieters Over a year ago

@EugeneK: that's like saying that using a dictionary is over-engineered when the user really should learn how to build a hash table.

Martijn Pieters Over a year ago

@EugeneK: There, added the proper manual way too. Not using str.replace() however.

|

baallezx · Accepted Answer · 2014-07-16 17:49:21Z

0

the simplest solution to this is just to use a str.replace() call

s = '\\n    this is a docstring for\\n    the main function.\\n    a,\\n    b,\\n    c\\n    '
s1 = s.replace('\\n','\n')
s1

output

'\n    this is a docstring for\n    the main function.\n    a,\n    b,\n    c\n    '

edited Jul 16, 2014 at 17:49

answered Jul 16, 2014 at 17:42

baallezx

4813 silver badges14 bronze badges

1 Comment

wnnmaw Over a year ago

Is that the actual output? Because that makes it look like your solution didn't actually work. (they should be on separate lines)

f.rodrigues · Accepted Answer · 2014-07-16 17:57:11Z

0

def convert_text(text):
    return text.replace("\\n","\n").replace("\\t","\t")


text = '\\n    this is a docstring for\\n    the main function.\\n    a,\\n    b,\\n    c\\n    '
print convert_text(text)

output:

    this is a docstring for
    the main function.
    a,
    b,
    c

answered Jul 16, 2014 at 17:57

f.rodrigues

3,5876 gold badges32 silver badges64 bronze badges

Collectives™ on Stack Overflow

How to convert a python string

3 Answers 3

The correct way to handle this

Why straightforward `str.replace()` won't cut it

The slow way to handle this, manually

11 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

The correct way to handle this

Why straightforward str.replace() won't cut it

The slow way to handle this, manually

11 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related

Why straightforward `str.replace()` won't cut it