2

I have a really crappy file full of unicode bytes that I'm trying to clean up. Some examples from the file are as follows:

ブラック
roler coaster
digital social party
big bellie
cornacopia
\xd0\xb7\xd1\x83\xd0\xb1\xd0\xbd\xd0\xb0\xd1\x8f \xd1\x89\xd0\xb5\xd1\x82\xd0\xba\xd0\xb0

Now, what I'd like to do is convert all those ugly byte points into real unicode text. So, the above would be output as:

ブラック
roler coaster
digital social party
big bellie
cornacopia
зубная щетка

I've been banging my head against how to do this in Perl for like an hour now, and I'm out of good ideas. If you have one, I'd love to hear it.

3
  • 1
    What do you mean by "unicode bytes"? Does the line following "cornacopia" (it's spelled "cornucopia", BTW) actually contain backslash characters? What kind of "real unicode text" do you want to produce (UTF-8? UTF-16? Something else?) Commented Jan 20, 2012 at 21:25
  • Yes it has the backslashes. I gave pasted exactly what's in the file. That's also why "cornacopia" is misspelled. I just want to convert it to utf8. Commented Jan 20, 2012 at 21:30
  • Encode::Escape, String::Escape - stackoverflow.com/questions/8740106/… stackoverflow.com/questions/2660123/… Commented Jan 21, 2012 at 9:06

1 Answer 1

9

It's UTF-8

$ perl -E'
    use open ":std", ":locale";
    use Encode qw( decode );
    $_ = q{\xd0\xb7\xd1\x83\xd0\xb1\xd0\xbd\xd0\xb0\xd1\x8f }.
         q{\xd1\x89\xd0\xb5\xd1\x82\xd0\xba\xd0\xb0};
    s/\\x(..)/chr hex $1/seg;
    $_ = decode("UTF-8", $_);
    say;
'
зубная щетка
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.