Possible to use wide-character members in Python extension objects?

Question

It's simple to create a member for an object in a Python C extension with a base type of char *, using the T_STRING define in the PyMemberDef declaration.

Why does there not seem to be an equivalent for wchar_t *? And if there actually is one, what is it?

e.g.

struct object contains char *text

PyMemberDef array has {"text", T_STRING, offsetof(struct object, text), READONLY, "This is a normal character string."}

versus something like

struct object contains wchar_t *wtext

PyMemberDef array has {"wtext", T_WSTRING, offsetof(struct object, wtext), READONLY, "This is a wide character string"}

I understand that something like PyUnicode_AsString() and its related methods can be used to encode the data in UTF-8, store it in a basic char string, and decode later, but doing it that way would require wrapping the generic getattr and setattr methods/functions with ones that account for the encoded text, and it's not very useful when you want character arrays of fixed element size within a struct and don't want the effective number of characters that can be stored in it to vary.

I don't know if this answers your question, but: depending on how Python is compiled, Py_UNICODE might be wchar_t. Python can either use 2 bytes per unicode character (i.e. wchar), or 4. So C code needs to use the PyUnicode_* functions to handle unicode strings without assuming what format they're stored in. — Thomas K
– Thomas K, Commented May 31, 2011 at 20:48
@Thomas: wchar_t is either two or four bytes, depending on platform. — Dietrich Epp
– Dietrich Epp, Commented Jun 1, 2011 at 4:50

samplebias · Accepted Answer · 2011-06-01 03:01:51Z

2

Using a wchar_t directly is not portable. Instead, Python defines the Py_UNICODE type as the storage unit for a Unicode character.

Depending on the platform, Py_UNICODE may be defined as wchar_t if available, or an unsigned short/integer/long, the width of which will vary depending on how Python is configured (UCS2 vs UCS4) and the architecture and C compiler used. You can find the relevant definitions in unicodeobject.h.

For your use case, your object can have an attribute that is a Unicode string, using T_OBJECT:

static struct PyMemberDef attr_members[] = {
  { "wtext", T_OBJECT, offsetof(PyAttrObject, wtext), READONLY, "wide string"}
  ...

You can perform type checking in the object's initializer:

...
if (!PyUnicode_CheckExact(arg)) {
    PyErr_Format(PyExc_ValueError, "arg must be a unicode string");
    return NULL;
}
Py_INCREF(arg);
self->wtext = arg;
...

If you ever need to iterate over the low-level characters in the Unicode string, there is a macro which returns a Py_UNICODE *:

int i = 0;
Py_ssize_t size = PyUnicode_GetSize(self->wtext);
Py_UNICODE *chars = PyUnicode_AS_UNICODE(self->wtext);
for (i = 0; i < size; i++) {
    // use chars[i]
    ...

answered Jun 1, 2011 at 3:01

samplebias

38k6 gold badges110 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

JAB Over a year ago

I see. If I'm not mistaken, though, the Python reference seems to recommend the use of T_OBJECT_EX over T_OBJECT due to how certain cases are handled.

samplebias Over a year ago

Yep, you could use T_OBJECT_EX instead. For a READONLY attribute (which cannot be deleted) a T_OBJECT should also work fine. Choice also depends on whether you want a NULL value for self->wtext to raise an error or just return None, which really depends on the behavior you want your object to exhibit.

Collectives™ on Stack Overflow

Possible to use wide-character members in Python extension objects?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related