TLDR; A UTF-8 encoding can and likely will take up less space than auto-detected Latin-1/UCS-2/UCS-4 encoding for text with mixed-script and/or emojis but it could also take up more for text mostly in non-latin languages.
UTF-8 is a variable-width encoding, which means it has a disadvantage over fixed-width encodings since it must use some bits to express the variability. Here's how some characters compare:
| character |
code |
UTF-8 size |
Python size |
a |
U+0061 |
1 |
1 |
ö |
U+00F6 |
2 |
1 |
߿ |
U+07FF |
2 |
2 |
你 |
U+4F60 |
3 |
2 |
🐍 |
U+1F40D |
4 |
4 |
However because the encoding is variable-width, UTF-8 only requires more bytes for characters that need them. This makes it superior than a fixed-width encoding for text that doesn't have many high-unicode characters:
| text |
UTF-8 size |
Python size |
long string all ascii |
21 |
21 |
long string with latin ö |
25 |
48 |
long string with n'ko ߿ |
24 |
74 |
long string with kanji 你 |
26 |
74 |
long string with emoji 🐍 |
27 |
126 |
So it really depends:
- if you're primarily working with ASCII or latin-based text with some likelihood of anything else, then UTF-8 is almost definitely going to be smaller
- if you're primarily working with languages that are in the U+0800 - U+FFFF range (see the set of Unicode blocks for those languages), then USC-2 may be better
- if you have emojis at all though, it will inflate the encoding to USC-4 in Python and double the memory size regardless of what else there is, so UTF-8 would always be better
So from that, UTF-8 is generally better for unknown text since it handles mixed-script characters decently well. You would only prefer a fixed-width size if you are confident in what the content of the text will be.
As a final note though, there's more ways to read a file in Python than just into a string, see Unicode (UTF-8) reading and writing to files in Python for example.
I'm using "emojis" as a stand-in for anything above U+10000, there are a lot more scripts above that beyond just emojis.