29

I am working with Amazon S3 uploads and am having trouble with key names being too long. S3 limits the length of the key by bytes, not characters.

From the docs:

The name for a key is a sequence of Unicode characters whose UTF-8 encoding is at most 1024 bytes long.

I also attempt to embed metadata in the file name, so I need to be able to calculate the current byte length of the string using Python to make sure the metadata does not make the key too long (in which case I would have to use a separate metadata file).

How can I determine the byte length of the utf-8 encoded string? Again, I am not interested in the character length... rather the actual byte length used to store the string.

3 Answers 3

44
def utf8len(s):
    return len(s.encode('utf-8'))

Works fine in Python 2 and 3.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. I also found a website that shows you how to do it in several languages here: rosettacode.org/wiki/String_length#Byte_Length_49
13

Use the string 'encode' method to convert from a character-string to a byte-string, then use len() like normal:

>>> s = u"¡Hola, mundo!"                                                      
>>> len(s)                                                                    
13 # characters                                                                             
>>> len(s.encode('utf-8'))   
14 # bytes

1 Comment

Please don't use str as a variable name! It will cause no end of grief.
8

Encoding the string and using len on the result works great, as other answers have shown. It does need to build a throw-away copy of the string - if you're working with very large strings this might not be optimal (I don't consider 1024 bytes to be large though). The structure of UTF-8 allows you to get the length of each character very easily without even encoding it, although it might still be easier to encode a single character. I present both methods here, they should give the same result.

def utf8_char_len_1(c):
    codepoint = ord(c)
    if codepoint <= 0x7f:
        return 1
    if codepoint <= 0x7ff:
        return 2
    if codepoint <= 0xffff:
        return 3
    if codepoint <= 0x10ffff:
        return 4
    raise ValueError('Invalid Unicode character: ' + hex(codepoint))

def utf8_char_len_2(c):
    return len(c.encode('utf-8'))

utf8_char_len = utf8_char_len_1

def utf8len(s):
    return sum(utf8_char_len(c) for c in s)

3 Comments

Note that in exchange for not making a copy this takes about 180x as long as len(s.encode('utf-8')), at least on my python 3.3.2 on a string of 1000 utf8 characters generated from the code here. (It'd be of comparable speed if you wrote the same algorithm in C, presumably.)
@Dougal, thanks for running the test. That's useful information, essential for evaluating possible solutions. I had a feeling it might be slower, but didn't know the magnitude. Did you try both versions?
The version with utf8_char_len_2 is about 1.5x slower than utf8_char_len_1. Of course, we're talking about under a millisecond in every case, so if you're just doing it a few times it doesn't matter at all: 2 µs / 375 µs / 600 µs. That said, copying 1kb of memory is also unlikely to matter either. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.