How to Make Perl Properly Interpret Unicode Bytes?

Question

I have a really crappy file full of unicode bytes that I'm trying to clean up. Some examples from the file are as follows:

ブラック
roler coaster
digital social party
big bellie
cornacopia
\xd0\xb7\xd1\x83\xd0\xb1\xd0\xbd\xd0\xb0\xd1\x8f \xd1\x89\xd0\xb5\xd1\x82\xd0\xba\xd0\xb0

Now, what I'd like to do is convert all those ugly byte points into real unicode text. So, the above would be output as:

ブラック
roler coaster
digital social party
big bellie
cornacopia
зубная щетка

I've been banging my head against how to do this in Perl for like an hour now, and I'm out of good ideas. If you have one, I'd love to hear it.

What do you mean by "unicode bytes"? Does the line following "cornacopia" (it's spelled "cornucopia", BTW) actually contain backslash characters? What kind of "real unicode text" do you want to produce (UTF-8? UTF-16? Something else?) — Keith Thompson
– Keith Thompson, Commented Jan 20, 2012 at 21:25
Yes it has the backslashes. I gave pasted exactly what's in the file. That's also why "cornacopia" is misspelled. I just want to convert it to utf8. — Eli
– Eli, Commented Jan 20, 2012 at 21:30
Encode::Escape, String::Escape - stackoverflow.com/questions/8740106/… stackoverflow.com/questions/2660123/… — daxim
– daxim, Commented Jan 21, 2012 at 9:06

ikegami · Accepted Answer · 2012-01-20 21:29:54Z

9

It's UTF-8

$ perl -E'
    use open ":std", ":locale";
    use Encode qw( decode );
    $_ = q{\xd0\xb7\xd1\x83\xd0\xb1\xd0\xbd\xd0\xb0\xd1\x8f }.
         q{\xd1\x89\xd0\xb5\xd1\x82\xd0\xba\xd0\xb0};
    s/\\x(..)/chr hex $1/seg;
    $_ = decode("UTF-8", $_);
    say;
'
зубная щетка

answered Jan 20, 2012 at 21:29

ikegami

391k17 gold badges291 silver badges555 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to Make Perl Properly Interpret Unicode Bytes?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related