3

I'm having trouble getting a replace() to work

I've tried my_string.replace('\\', '') and re.sub('\\', '', my_string), but neither one works.

I thought \ was the escape code for backslash, am I wrong?

The string in question looks like

'<2011315123.04C6DACE618A7C2763810@\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4>'

or print my_string <2011315123.04C6DACE618A7C2763810@???ꂩ?猩???邾?낤>

Yes, it's supposed to look like garbage, but I'd rather get '<2011315123.04C6DACE618A7C2763810@82b182ea82a982e78ca982a682e982be82eb82a4>'

3
  • Related: stackoverflow.com/questions/92438/… Commented Apr 24, 2011 at 0:56
  • That doesn't really help. I want my string to only contain ascii character, but I don't want to completely stripout the non-ascii characters, just make them ascii literals. Commented Apr 24, 2011 at 1:04
  • I want the ascii because it GREATLY simplifies the regex search string I can use. I can check for \@[\w\.]+\ and be done with it, because I know if I get a ']', '>', ' ' or anything of the sort my domain name is finished. Commented Apr 25, 2011 at 7:54

2 Answers 2

8

You don't have any backslashes in your string. What you don't have, you can't remove.

Consider what you are showing as '\x82' ... this is a one-byte string.

>>> s = '\x82'
>>> len(s)
1
>>> ord(s)
130
>>> hex(ord(s))
'0x82'
>>> print s
é # my sys.stdout.encoding is 'cp850'
>>> print repr(s)
'\x82'
>>>

What you'd "rather get" ('x82') is meaningless.

Update The "non-ascii" part of the string (bounded by @ and >) is actually Japanese text written mostly in Hiragana and encoded using shift_jis. Transcript of IDLE session:

>>> y = '\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4'
>>> print y.decode('shift_jis')
これから見えるだろう

Google Translate produces "Can not you see the future" as the English translation.

In a comment on another answer, you say:

I just need ascii

and

What I'm doing with it is seeing how far apart the two strings are using nltk.edit_distance(), so this will give me a multiple of the true distance. Which is good enough for me.

Why do you think you need ASCII? Edit distance is defined quite independently of any alphabet.

For a start, doing nonsensical transformations of your strings won't give you a consistent or predicable multiple of the true distance. Secondly, out of the following:

x
repr(x)
repr(x).replace('\\', '')
repr(x).replace('\\x', '') # if \ is noise, so is x
x.decode(whatever_the_encoding_is)

why do you choose the third?

Update 2 in response to comments:

(1) You still haven't said why you think you need "ascii". nltk.edit_distance doesn't require "ascii" -- the args are said to be "strings" (whatever that means) but the code will work with any 2 sequences of objects for which != works. In other words, why not just use the first of the above 5 options?

(2) Accepting up to 100% inflation of the edit distance is somwhat astonishing. Note that your currently chosen method will use 4 symbols (hex digits) per Japanese character. repr(x) uses 8 symbols per character. x (the first option) uses 2.

(3) You can mitigate the inflation effect by normalising your edit distance. Instead of comparing distance(s1, s2) with a number_of_symbols threshold, compare distance(s1, s2) / float(max(len(s1), len(s2))) with a fraction threshold. Note normalisation is usually used anyway ... the rationale being that the dissimilarity between 20-symbol strings with an edit distance of 4 is about the same as that between 10-symbol strings with an edit distance of 2, not twice as much.

(4) nltk.edit_distance is the most shockingly inefficient pure-Python implementation of edit_distance that I've ever seen. This implementation by Magnus Lie Hetland is much better, but still capable of improvement.

Sign up to request clarification or add additional context in comments.

8 Comments

Yeah, I figured that out after pulling it up in a texteditor. I was getting repr and print representations of the character. Thanks.
@Joshua Olson: The first edition of my answer answered your question correctly. The fact that you want to do something else has nothing to do with whether you should accept my answer.
The problem is I don't know what the encoding (SPAM messages which are the source of the strings aren't often well formed) is and I need some representation of them (Yes, the x is garbage too, I ended up stripping both the \ and x out just keeping the hex of the letter, your 4th example) to compare in edit_distance and if I have a string of hex numbers I can compare their distance just as well as if I was using the decoded string. If you know of a way of identifying the encoding based on a handful of character that's as straight forward as repr(x).replace('\\x', '') then I'd use it.
I've accepted your answer since it now covers the explanation and what I was looking for. I wish there were a better solution, but without knowing the encoding I'm stuck with doing it this way. Some of my data doesn't even have a domain name and that's causing me all kinds of other headaches as far as how to handle it without throwing my numbers off completely.
Taking the hex values (minus the \x) of the characters should give me an edit distance between 1.0 ~ 2.0 of the true edit distance, especially when both strings are transformed in this way. Yes, leaving the using '\\' instead of '\\x' wouldn't make as much sense, but it wouldn't do much harm either since both strings would be transformed in the same way.
|
2

This works i think if you really want to just strip the "\"

>>> a = '<2011315123.04C6DACE618A7C2763810@\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4>'
>>> repr(a).replace("\\","")[1:-1]
'<2011315123.04C6DACE618A7C2763810@x82xb1x82xeax82xa9x82xe7x8cxa9x82xa6x82xe9x82xbex82xebx82xa4>'
>>> 

But like the answer above, what you get is pretty much meaningless.

3 Comments

Well, sometimes there is a good reason someone wants to do something, that I cant come up with. I just offered a solution with a warning...
Wait. That might be the exact solution I'm looking for. I know it's nonsense, but I just need ascii that I can parse in a consistent way with another part of the same string (From and Message-ID fields of spam messages). What I'm doing with it is seeing how far apart the two strings are using nltk.edit_distance(), so this will give me a multiple of the true distance. Which is good enough for me.
What is "this" that amuses you so much?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.