-1

In https://stackoverflow.com/a/24542608/6629672 I found that Rust uses variable-length encoding for strings, which is why you can't easily index them. Python, on the other hand, uses a fixed length for its strings. Does this mean that Rust can use less memory when reading a file with non-ASCII UTF-8 characters? Do you have an example?

Relevant: https://rushter.com/blog/python-strings-and-memory/

1
  • I guess yes but that your only question ? Commented Sep 4, 2023 at 17:59

2 Answers 2

3

TLDR; A UTF-8 encoding can and likely will take up less space than auto-detected Latin-1/UCS-2/UCS-4 encoding for text with mixed-script and/or emojis but it could also take up more for text mostly in non-latin languages.


UTF-8 is a variable-width encoding, which means it has a disadvantage over fixed-width encodings since it must use some bits to express the variability. Here's how some characters compare:

character code UTF-8 size Python size
a U+0061 1 1
ö U+00F6 2 1
߿ U+07FF 2 2
U+4F60 3 2
🐍 U+1F40D 4 4

However because the encoding is variable-width, UTF-8 only requires more bytes for characters that need them. This makes it superior than a fixed-width encoding for text that doesn't have many high-unicode characters:

text UTF-8 size Python size
long string all ascii 21 21
long string with latin ö 25 48
long string with n'ko ߿ 24 74
long string with kanji 你 26 74
long string with emoji 🐍 27 126

So it really depends:

  • if you're primarily working with ASCII or latin-based text with some likelihood of anything else, then UTF-8 is almost definitely going to be smaller
  • if you're primarily working with languages that are in the U+0800 - U+FFFF range (see the set of Unicode blocks for those languages), then USC-2 may be better
  • if you have emojis at all though, it will inflate the encoding to USC-4 in Python and double the memory size regardless of what else there is, so UTF-8 would always be better

So from that, UTF-8 is generally better for unknown text since it handles mixed-script characters decently well. You would only prefer a fixed-width size if you are confident in what the content of the text will be.

As a final note though, there's more ways to read a file in Python than just into a string, see Unicode (UTF-8) reading and writing to files in Python for example.

I'm using "emojis" as a stand-in for anything above U+10000, there are a lot more scripts above that beyond just emojis.

Sign up to request clarification or add additional context in comments.

Comments

1

Depends on the ratio of high code points to low code points. If most code points in your strings remain in the ranges [U+0000, U+07FF], [U+10000, U+10FFFF] then variable-length is likely to save space compared to a fixed-length 2-byte encoding. Variable-length encoding will always save space compared to a 4-byte encoding. 2-byte encoding only saves space when more than 50% of code points are in the range [U+0800, U+FFFF].

7 Comments

Python has not used fixed-length encoding since Python 3.3 or so, it uses a variable-fixed length: the string will be encoded in latin1, UCS2, or UCS4 depending on requirements. This rather changes the computation, as low codepoint strings will be quite efficient (especially western texts which always fit in latin1) but a single astral codepoint will blow up memory use (say, an emoji).
@Masklinn what you describe is fixed length encoding.
No. By definition a fixed length encoding has a fixed length for every input symbol. That is not the case of cpython with FSR (and pypy uses utf8 with indexes).
@Masklinn CPython since 3.3 uses latin1, UCS2, or UCS4 depending on the requirement. As you said. All three are fixed-length encodings. The length of each codepoint is 1, 2 or 4 bytes respectively, and is fixed for a single string.
@user2722968 mate I literally referred to the feature mutliple times, I know what cpython does. "the length of each codepoint is 1, 2 or 4 bytes" does that sound fixed to you?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.