Python 3.x makes a clear distinction between the types:
str = '...' literals = a sequence of characters. A “character” is a basic unit of text: a letter, digit, punctuation mark, symbol, space, or “control character” (like tab or backspace). The Unicode standard assigns each character to an integer code point between 0 and 0x10FFFF. (Well, more or less. Unicode includes ligatures and combining characters, so a string might not have the same number of code points as user-perceived characters.) Internally, str uses a flexible string representation that can use either 1, 2, or 4 bytes per code point.
bytes = b'...' literals = a sequence of bytes. A “byte” is the smallest integer type addressable on a computer, which is nearly universally an octet, or 8-bit unit, thus allowing numbers between 0 and 255.
If you're familiar with:
- Java or C#, think of
str as String and bytes as byte[];
- SQL, think of
str as NVARCHAR and bytes as BINARY or BLOB;
- Windows registry, think of
str as REG_SZ and bytes as REG_BINARY.
If you're familiar with C(++), then forget everything you've learned about char and strings, because a character is not a byte. That idea is long obsolete.
You use str when you want to represent text.
print('שלום עולם')
You use bytes when you want to represent low-level binary data like structs.
NaN = struct.unpack('>d', b'\xff\xf8\x00\x00\x00\x00\x00\x00')[0]
You can encode a str to a bytes object.
>>> '\uFEFF'.encode('UTF-8')
b'\xef\xbb\xbf'
And you can decode a bytes into a str.
>>> b'\xE2\x82\xAC'.decode('UTF-8')
'€'
But you can't freely mix the two types.
>>> b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str
The b'...' notation is somewhat confusing in that it allows the printable range of ASCII characters i.e. character code 32 - 126 (space " " up to tilde "~") to be used as a shorthand directly instead of their hex values (0x20 up to 0x7E)
>>> b'A' == b'\x41'
True
But I must emphasize, a character is not a byte.
>>> 'A' == b'A'
False
In Python 2.x
Pre-3.0 versions of Python lacked this kind of distinction between text and binary data. Instead, there was:
unicode = u'...' literals = sequence of Unicode characters = 3.x str
str = '...' literals = sequences of confounded bytes/characters
- Usually text, encoded in some unspecified encoding.
- But also used to represent binary data like
struct.pack output.
In order to ease the 2.x-to-3.x transition, the b'...' literal syntax was backported to Python 2.6, in order to allow distinguishing binary strings (which should be bytes in 3.x) from text strings (which should be str in 3.x). The b prefix does nothing in 2.x, but tells the 2to3 script not to convert it to a Unicode string in 3.x.
So yes, b'...' literals in Python have the same purpose that they do in PHP.
Also, just out of curiosity, are there
more symbols than the b and u that do
other things?
The r prefix creates a raw string (e.g., r'\t' is a backslash + t instead of a tab), and triple quotes '''...''' or """...""" allow multi-line string literals.
The f prefix (introduced in Python 3.6) creates a “formatted string literal” which can reference Python variables. For example, f'My name is {name}.' is shorthand for 'My name is {0}.'.format(name).
stringprefix::= "r" | "u" | "R" | "U" | "f" | "F" | "fr" | "Fr" | "fR" | "FR" | "rf" | "rF" | "Rf" | "RF"bytesprefix::= "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB" Documentation: String and Bytes literals