Does Rust take up less memory when reading in UTF-8 file than python?

Question

In https://stackoverflow.com/a/24542608/6629672 I found that Rust uses variable-length encoding for strings, which is why you can't easily index them. Python, on the other hand, uses a fixed length for its strings. Does this mean that Rust can use less memory when reading a file with non-ASCII UTF-8 characters? Do you have an example?

Relevant: https://rushter.com/blog/python-strings-and-memory/

I guess yes but that your only question ?

Stargateur
– Stargateur

2023-09-04 17:59:05 +00:00
Commented Sep 4, 2023 at 17:59 — Stargateur
– Stargateur, Commented Sep 4, 2023 at 17:59

kmdreko · Accepted Answer · 2023-09-04 18:06:19Z

TLDR; A UTF-8 encoding can and likely will take up less space than auto-detected Latin-1/UCS-2/UCS-4 encoding for text with mixed-script and/or emojis but it could also take up more for text mostly in non-latin languages.

UTF-8 is a variable-width encoding, which means it has a disadvantage over fixed-width encodings since it must use some bits to express the variability. Here's how some characters compare:

character	code	UTF-8 size	Python size
`a`	U+0061	1	1
`ö`	U+00F6	2	1
`߿`	U+07FF	2	2
`你`	U+4F60	3	2
`🐍`	U+1F40D	4	4

However because the encoding is variable-width, UTF-8 only requires more bytes for characters that need them. This makes it superior than a fixed-width encoding for text that doesn't have many high-unicode characters:

text	UTF-8 size	Python size
`long string all ascii`	21	21
`long string with latin ö`	25	48
`long string with n'ko ߿`	24	74
`long string with kanji 你`	26	74
`long string with emoji 🐍`	27	126

So it really depends:

if you're primarily working with ASCII or latin-based text with some likelihood of anything else, then UTF-8 is almost definitely going to be smaller
if you're primarily working with languages that are in the U+0800 - U+FFFF range (see the set of Unicode blocks for those languages), then USC-2 may be better
if you have emojis at all though, it will inflate the encoding to USC-4 in Python and double the memory size regardless of what else there is, so UTF-8 would always be better

So from that, UTF-8 is generally better for unknown text since it handles mixed-script characters decently well. You would only prefer a fixed-width size if you are confident in what the content of the text will be.

As a final note though, there's more ways to read a file in Python than just into a string, see Unicode (UTF-8) reading and writing to files in Python for example.

^{I'm using "emojis" as a stand-in for anything above U+10000, there are a lot more scripts above that beyond just emojis.}

PitaJ · Accepted Answer · 2023-09-04 17:06:07Z

1

Depends on the ratio of high code points to low code points. If most code points in your strings remain in the ranges [U+0000, U+07FF], [U+10000, U+10FFFF] then variable-length is likely to save space compared to a fixed-length 2-byte encoding. Variable-length encoding will always save space compared to a 4-byte encoding. 2-byte encoding only saves space when more than 50% of code points are in the range [U+0800, U+FFFF].

answered Sep 4, 2023 at 17:06

PitaJ

15.4k6 gold badges43 silver badges58 bronze badges

7 Comments

Masklinn Over a year ago

Python has not used fixed-length encoding since Python 3.3 or so, it uses a variable-fixed length: the string will be encoded in latin1, UCS2, or UCS4 depending on requirements. This rather changes the computation, as low codepoint strings will be quite efficient (especially western texts which always fit in latin1) but a single astral codepoint will blow up memory use (say, an emoji).

yuri kilochek Over a year ago

@Masklinn what you describe is fixed length encoding.

Masklinn Over a year ago

No. By definition a fixed length encoding has a fixed length for every input symbol. That is not the case of cpython with FSR (and pypy uses utf8 with indexes).

user2722968 Over a year ago

@Masklinn CPython since 3.3 uses latin1, UCS2, or UCS4 depending on the requirement. As you said. All three are fixed-length encodings. The length of each codepoint is 1, 2 or 4 bytes respectively, and is fixed for a single string.

Masklinn Over a year ago

@user2722968 mate I literally referred to the feature mutliple times, I know what cpython does. "the length of each codepoint is 1, 2 or 4 bytes" does that sound fixed to you?

|

Collectives™ on Stack Overflow

Does Rust take up less memory when reading in UTF-8 file than python?

2 Answers 2

Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related