1

I am reading about Perl's Encode and utf8.

The doc says:

$octets = encode_utf8($string);

Equivalent to

$octets = encode("utf8", $string) .

The characters in $string are encoded in Perl's internal format, and the result is returned as a sequence of octets.

I have no idea what this means. Isn't a string in Perl a sequence of octets (i.e. bytes) anyway?

So what is the difference between:

$string and $octets?

1 Answer 1

4

No, a string in Perl is a sequence of characters, not necessarily octets. The chr and ord functions (for transforming between integers and single characters), to name two, can deal with integer values larger than 255. For example

$string = "\x{0421}\x{041F}";
print ord($_)," " for split //, $string;

outputs

1057 1055

When a string is written to a terminal, file, or other output stream, the device receiving the string usually requires and expects bytes, however, so this is where encoding comes in. As you have seen, UTF-8 is a scheme for encoding single value in the range 0x7F-0x10FFFF into multiple bytes.

$octets = Encode::encode("utf-8", "\x{0421}\x{041F}");
print ord($_)," " for split //, $octets;

Now the output is

208 161 208 159

and suitable to be stored on a filesystem.

Internally, perl (in all lower case, this refers to the executable implementation of Perl, the programming language specification) often uses UTF-8 to represent strings with "wide" characters, but this is not something you would every normally have to worry about.

Sign up to request clarification or add additional context in comments.

7 Comments

What does \x{0421} mean?
\x{0421} is the character whose encoding is the hex number 0x421. That's apparently 'CYRILLIC CAPITAL LETTER ES' (see fileformat.info/info/unicode/char/421)
So why not decode instead of encode here to decode into Perl characters?
"UTF-8 characters" isn't a thing. There are Unicode code points (from 0-0x10FFFF) and there is a UTF-8 encoding that represents all Unicode code points as one or more octets (bytes). When you use substr, chop, split, regular expressions, or any other feature of Perl that acts on a string, you are dealing with characters.
... and you don't have to care whether those characters originally came from a UTF-8 or latin-1 or whatever source.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.