Perl strings encode utf 8

Question

I am reading about Perl's Encode and utf8.

The doc says:

$octets = encode_utf8($string);

Equivalent to

$octets = encode("utf8", $string) .

The characters in $string are encoded in Perl's internal format, and the result is returned as a sequence of octets.

I have no idea what this means. Isn't a string in Perl a sequence of octets (i.e. bytes) anyway?

So what is the difference between:

$string and $octets?

mob · Accepted Answer · 2013-06-20 15:10:34Z

4

No, a string in Perl is a sequence of characters, not necessarily octets. The chr and ord functions (for transforming between integers and single characters), to name two, can deal with integer values larger than 255. For example

$string = "\x{0421}\x{041F}";
print ord($_)," " for split //, $string;

outputs

1057 1055

When a string is written to a terminal, file, or other output stream, the device receiving the string usually requires and expects bytes, however, so this is where encoding comes in. As you have seen, UTF-8 is a scheme for encoding single value in the range 0x7F-0x10FFFF into multiple bytes.

$octets = Encode::encode("utf-8", "\x{0421}\x{041F}");
print ord($_)," " for split //, $octets;

Now the output is

208 161 208 159

and suitable to be stored on a filesystem.

Internally, perl (in all lower case, this refers to the executable implementation of Perl, the programming language specification) often uses UTF-8 to represent strings with "wide" characters, but this is not something you would every normally have to worry about.

answered Jun 20, 2013 at 15:10

mob

119k18 gold badges159 silver badges291 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Jim Jim Over a year ago

What does \x{0421} mean?

Dave Cross Over a year ago

\x{0421} is the character whose encoding is the hex number 0x421. That's apparently 'CYRILLIC CAPITAL LETTER ES' (see fileformat.info/info/unicode/char/421)

Jim Jim Over a year ago

So why not decode instead of encode here to decode into Perl characters?

mob Over a year ago

"UTF-8 characters" isn't a thing. There are Unicode code points (from 0-0x10FFFF) and there is a UTF-8 encoding that represents all Unicode code points as one or more octets (bytes). When you use substr, chop, split, regular expressions, or any other feature of Perl that acts on a string, you are dealing with characters.

innaM Over a year ago

... and you don't have to care whether those characters originally came from a UTF-8 or latin-1 or whatever source.

|

Collectives™ on Stack Overflow

Perl strings encode utf 8

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related