2

How would I convert this string

'\\n    this is a docstring for\\n    the main function.\\n    a,\\n    b,\\n    c\\n    '

into

'\n    this is a docstring for\n    the main function.\n    a,\n    b,\n    c\n    '

keeping in mind I would also like to do this for '\t' and all other escaped characters. The code for the reverse way is

def fix_string(s):
    """ takes the string and replaces any `\n` with `\\n` so that the read file will be recognized """
    # escape chars = \t , \b , \n , \r , \f , \' , \" , \\
    new_s = ''
    for i in s:
            if i == '\t':
                    new_s += '\\t'
            elif i == '\b':
                    new_s += '\\b'
            elif i == '\n':
                    new_s += '\\n'
            elif i == '\r':
                    new_s += '\\r'
            elif i == '\f':
                    new_s += '\\f'
            elif i == '\'':
                    new_s += "\\'"
            elif i == '\"':
                    new_s += '\\"'
            else:
                    new_s += i
    return new_s

would I possibly need to look at the actual numeric values for the characters and check the next character say if I find a ('\',92) character followed by a ('n',110)?

8
  • Do you have the order of the two strings backwards? Commented Jul 16, 2014 at 17:37
  • 5
    consider using str.replace. Commented Jul 16, 2014 at 17:38
  • @TheSoundDefense - no. I am just giving an example of how I would do this the reverse way. Commented Jul 16, 2014 at 17:38
  • 1
    Does your string actually contain the three characters '\\n'? Or is it appearing in some escaped form? Commented Jul 16, 2014 at 17:39
  • 2
    @hughdbrown, that throws an error Commented Jul 16, 2014 at 17:40

3 Answers 3

4

Don't reinvent the wheel here. Python has your back. Besides, handling escape syntax correctly, is harder than it looks.

The correct way to handle this

In Python 2, use the str-to-str string_escape codec:

string.decode('string_escape')

This interprets any Python-recognized string escape sequences for you, including \n and \t.

Demo:

>>> string = '\\n    this is a docstring for\\n    the main function.\\n    a,\\n    b,\\n    c\\n    '
>>> string.decode('string_escape')
'\n    this is a docstring for\n    the main function.\n    a,\n    b,\n    c\n    '
>>> print string.decode('string_escape')

    this is a docstring for
    the main function.
    a,
    b,
    c

>>> '\\t\\n\\r\\xa0\\040'.decode('string_escape')
'\t\n\r\xa0 '

In Python 3, you'd have to use the codecs.decode() and the unicode_escape codec:

codecs.decode(string, 'unicode_escape')

as there is no str.decode() method and this is not a str -> bytes conversion.

Demo:

>>> import codecs
>>> string = '\\n    this is a docstring for\\n    the main function.\\n    a,\\n    b,\\n    c\\n    '
>>> codecs.decode(string, 'unicode_escape')
'\n    this is a docstring for\n    the main function.\n    a,\n    b,\n    c\n    '
>>> print(codecs.decode(string, 'unicode_escape'))

    this is a docstring for
    the main function.
    a,
    b,
    c

>>> codecs.decode('\\t\\n\\r\\xa0\\040', 'unicode_escape')
'\t\n\r\xa0 '

Why straightforward str.replace() won't cut it

You could try to do this yourself with str.replace(), but then you also need to implement proper escape parsing; take \\\\n for example; this is \\n, escaped. If you naively apply str.replace() in sequence, you end up with \n or \\\n instead:

>>> '\\\\n'.decode('string_escape')
'\\n'
>>> '\\\\n'.replace('\\n', '\n').replace('\\\\', '\\')
'\\\n'
>>> '\\\\n'.replace('\\\\', '\\').replace('\\n', '\n')
'\n'

The \\ pair should be replaced by just one \ characters, leaving the n uninterpreted. But the replace option either will end up replacing the trailing \ together with the n with a newline character, or you end up with \\ replaced by \, and then the \ and the n are replaced by a newline. Either way, you end up with the wrong output.

The slow way to handle this, manually

You'll have to process the characters one by one instead, pulling in more characters as needed:

_map = {
    '\\\\': '\\',
    "\\'": "'",
    '\\"': '"',
    '\\a': '\a',
    '\\b': '\b',
    '\\f': '\f',
    '\\n': '\n',
    '\\r': '\r',
    '\\t': '\t',
}

def unescape_string(s):
    output = []
    i = 0
    while i < len(s):
        c = s[i]
        i += 1
        if c != '\\':
            output.append(c)
            continue
        c += s[i]
        i += 1
        if c in _map:
            output.append(_map[c])
            continue
        if c == '\\x' and i < len(s) - 2:  # hex escape
            point = int(s[i] + s[i + 1], 16)
            i += 2
            output.append(chr(point))
            continue
        if c == '\\0':  # octal escape
            while len(c) < 4 and i < len(s) and s[i].isdigit():
                c += s[i]
                i += 1
            point = int(c[1:], 8)
            output.append(chr(point))
    return ''.join(output)

This now can handle the \xhh and the standard 1-letter escapes, but not the \0.. octal escape sequences, or \uhhhh Unicode code points, or \N{name} unicode name references, nor does it handle malformed escapes in quite the same way as Python would.

But it does handle the escaped escape properly:

>>> unescape_string(string)
'\n    this is a docstring for\n    the main function.\n    a,\n    b,\n    c\n    '
>>> unescape_string('\\\\n')
'\\n'

Do know this is far slower than using the built-in codec.

Sign up to request clarification or add additional context in comments.

11 Comments

Not my downvote, but its probably because the argument for string.decode() is wrong
I'd say it's also over engineered and removes the ability for the asker to learn the fundamentals of something like a "string replace".
@EugeneK: How is this overengineered? The codec exists for just this purpose.
@EugeneK: that's like saying that using a dictionary is over-engineered when the user really should learn how to build a hash table.
@EugeneK: There, added the proper manual way too. Not using str.replace() however.
|
0

the simplest solution to this is just to use a str.replace() call

s = '\\n    this is a docstring for\\n    the main function.\\n    a,\\n    b,\\n    c\\n    '
s1 = s.replace('\\n','\n')
s1

output

'\n    this is a docstring for\n    the main function.\n    a,\n    b,\n    c\n    '

1 Comment

Is that the actual output? Because that makes it look like your solution didn't actually work. (they should be on separate lines)
0
def convert_text(text):
    return text.replace("\\n","\n").replace("\\t","\t")


text = '\\n    this is a docstring for\\n    the main function.\\n    a,\\n    b,\\n    c\\n    '
print convert_text(text)

output:

    this is a docstring for
    the main function.
    a,
    b,
    c

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.