PyUnicode_SubtypeFromData draft

encukou · August 22, 2025, 4:44pm

As identified in previous discussion, the remaining use case for PyUnicodeObject internals is initializing sublasses.
Let’s solve this, so we can eventually deprecate – and then change – the internal layout.

My design largely converged to @vstinner’s PEP 756 – PyUnicode_Export() and PyUnicode_Import(), so it’ll be familiar if you’ve read that.

PyUnicode_Import – here, PyUnicode_SubtypeFromData – will be O(1) if you feed it data in just the right format, which means it performance should be comparable to poking PyUnicodeObject directly.
But if you feed it data in the wrong format (e.g. CPython just changed internals, or you’re running on PyPy), the call will be slower but still correct.
(Optimizing for non-CPython interpreters should be possible, though harder.)
PyUnicode_Export is based directly on Victor’s PEP, with flags changed
to match the import, and a fast “failure” case (no exception allocation).

I’ve also included flags for embedded NUL and embedded surrogates, properties we might want to cache.

String “import”

This function is designed for “zero-copy” operation if the provided buffer is in the exact format that PyUnicode internals use.

CPython implementation detail: if type is the exact PyUnicode_Type, data will be copied (the function will create a “compact” string).

int PyUnicode_SubtypeFromData(
    PyTypeObject *type,
    PyObject **result,
    void *data,
    Py_ssize_t nbytes,
    int32_t format,
    int32_t flags);

Allocate an instance of type, which must be a subtype of PyUnicode_Type, and initialize its contents.
Instance memory that does not belong to PyUnicode is not initialized (so if subtype->tp_alloc is the recommended PyType_GenericAlloc(), it will be zeroed).

On success, set *result to the new object, and return 0 (or other non-negative value as documented below).

On error, set *result to NULL and return -1 with an exception set.

data must point to a buffer of nbytes bytes, in the format given by format. The data will be copied to the object’s internal storage (unless flags specify otherwise as documented below).

result must not be NULL.
data may only be NULL if nbytes is zero.

format must be exactly one of the following:

#define PyUnicode_FORMAT_UCS1  0x01   // Py_UCS1 *data
#define PyUnicode_FORMAT_UCS2  0x02   // Py_UCS2 *data
#define PyUnicode_FORMAT_UCS4  0x04   // Py_UCS4 *data
#define PyUnicode_FORMAT_UTF8  0x08   // char *data

(Future versions of CPython, and other Python implementations, may allow additional values.)

On implementations that allow surrogates in str (including CPython), the data shall be decoded using the ‘surrogatepass’ error handler. (But see the “SURROGATES” flag below)

flags is a bitwise-OR combination of the following.
Unused bits must be unset.

The interpreter may safely ignore any flag.

Some flags are paired to encode assertions: there is a “yes” flag and a “no” flag. If neither is set, then the property in question is unknown.
If both are set, or if an asserted property isn’t actually true, behaviour is undefined.

#define PyUnicode_FLAG_CONSUME_BUFFER 0x0001

The interpreter may take ownership of the buffer, which must have been
allocated using PyMem_Malloc.

If it does, PyUnicode_SubclassFromData will return 1; in this case the
caller must not use the buffer further.
#define PyUnicode_FLAG_EXTRA_NUL_TERMINATOR 0x0002

There is an extra null terminator (1-4 bytes, depending on format) just
past the end of the buffer.

Note that the terminator is not included in nbytes. (This allows the flag
to be ignorable.)

#define PyUnicode_FLAG_EMBEDDED_NUL              0x0100  // yes
#define PyUnicode_FLAG_NO_EMBEDDED_NUL           0x0200  // no

The string contains at least one null character.

#define PyUnicode_FLAG_SURROGATES                0x0400  // yes
#define PyUnicode_FLAG_NO_SURROGATES             0x0800  // no

The string contains at least one surrogate.

```
#define PyUnicode_FLAG_TIGHT_FORMAT              0x1000  // yes
#define PyUnicode_FLAG_LARGE_FORMAT              0x2000  // no
```
- UCS1: there’s at least one non-ASCII character (>127)
- UCS2: there’s at least one non-UCS1 character (>255)
- UCS4: there’s at least one non-UCS2 character (>65535)
- UTF8: unused; both flags must be zero. (Use UCS1 for known ASCII.)
```
#define PyUnicode_FLAG_INVALID_UNICODE           0x4000  // yes
#define PyUnicode_FLAG_VALID_UNICODE             0x8000  // no
```
The buffer contains at least one invalid UTF-8 sequence (including overlong
encodings) or out-of-range value (one greater than 0x10ffff).

(CPython will reject invalid strings.)
The sign bit is reserved for errors.

Flag information

Inspired by PyLong_GetNativeLayout.

const PyUnicodeFlagInfo *PyUnicode_GetFlagInfo(int32_t format);

On success, return a pointer to a statically allocated PyUnicodeFlagInfo structure that describes PyUnicode_SubtypeFromData behaviour.
On error, return NULL with an exception set.

format may be zero, or a value accepted by PyUnicode_SubtypeFromData to
get information specific to a given format.

struct PyUnicodeFlagInfo has 3 fields:

int32_t recognized_formats: bit-mask of allowable bits for the format argument.
(PyUnicode_SubtypeFromData will fail if any other bits are set.)
int32_t preferred_formats: bit-mask of format bits that are preferred – for example, give the best performance.
int32_t recognized_flags: bit-mask of flags bits that the interpreter checks or sets.
int32_t preferred_flags: bit-mask of flags bits that are most likely to give you best performance (e.g. zero-copy).

Export API

This doesn’t solve my motivation; if it’s controversial, the proposal
will be fine without it.

int32_t PyUnicode_Export(
    PyObject *unicode,
    int32_t formats,
    Py_buffer *view,
    int32_t *flags)

Export the contents of the unicode string in one of the requested_formats.

On success, fill *view, and *flags and return a format (greater than 0).
If none of the requested formats is available, zero-fill *view and *flags and return 0.
On other error, zero-fill *view and *flags, and return -1 with an exception
set.

After a successful call to PyUnicode_Export(), the view buffer must be released by PyBuffer_Release(). The contents of the buffer are valid until they are released.

The resulting buffer is read-only and must not be modified.

flags may be NULL, in which case *flags is never set.
Otherwise, on sucess, *flags is set to optional flags that describe the buffer, as with PyUnicode_SubtypeFromData.
The interpreter is free to leave any flag bit unset even if the given feature is present.
Flag notes:

PyUnicode_FLAG_CONSUME_BUFFER is never set
PyUnicode_FLAG_EXTRA_NULL_TERMINATOR signals a NUL terminator that is
not counted in view->len.

The string data is not copied and no conversion is done.
There is no guarantee that PyUnicode_Export will succeed; buffer availability is an implementation detail that may change at any time (even at runtime for a given object). Same for flags.
If PyUnicode_Export returns 0, most extensions should fall back to a function like PyUnicode_AsUTF8AndSize .

Note that future versions of Python may introduce additional formats, including combinations with the existing flags.
This means that result should not be treated as a bit-field.

Future extensions

High bits of format may be used for behaviour-changing flags: decode surrogates, strip BOM, specify byte order, initialize existing object, share a non-PyMem_Malloc’d buffer,
lazy loading and similar.
CPython does not need to implement these if they only help other implementations.