2

It's simple to create a member for an object in a Python C extension with a base type of char *, using the T_STRING define in the PyMemberDef declaration.

Why does there not seem to be an equivalent for wchar_t *? And if there actually is one, what is it?

e.g.

struct object contains char *text

PyMemberDef array has {"text", T_STRING, offsetof(struct object, text), READONLY, "This is a normal character string."}

versus something like

struct object contains wchar_t *wtext

PyMemberDef array has {"wtext", T_WSTRING, offsetof(struct object, wtext), READONLY, "This is a wide character string"}

I understand that something like PyUnicode_AsString() and its related methods can be used to encode the data in UTF-8, store it in a basic char string, and decode later, but doing it that way would require wrapping the generic getattr and setattr methods/functions with ones that account for the encoded text, and it's not very useful when you want character arrays of fixed element size within a struct and don't want the effective number of characters that can be stored in it to vary.

2
  • I don't know if this answers your question, but: depending on how Python is compiled, Py_UNICODE might be wchar_t. Python can either use 2 bytes per unicode character (i.e. wchar), or 4. So C code needs to use the PyUnicode_* functions to handle unicode strings without assuming what format they're stored in. Commented May 31, 2011 at 20:48
  • @Thomas: wchar_t is either two or four bytes, depending on platform. Commented Jun 1, 2011 at 4:50

1 Answer 1

2

Using a wchar_t directly is not portable. Instead, Python defines the Py_UNICODE type as the storage unit for a Unicode character.

Depending on the platform, Py_UNICODE may be defined as wchar_t if available, or an unsigned short/integer/long, the width of which will vary depending on how Python is configured (UCS2 vs UCS4) and the architecture and C compiler used. You can find the relevant definitions in unicodeobject.h.

For your use case, your object can have an attribute that is a Unicode string, using T_OBJECT:

static struct PyMemberDef attr_members[] = {
  { "wtext", T_OBJECT, offsetof(PyAttrObject, wtext), READONLY, "wide string"}
  ...

You can perform type checking in the object's initializer:

...
if (!PyUnicode_CheckExact(arg)) {
    PyErr_Format(PyExc_ValueError, "arg must be a unicode string");
    return NULL;
}
Py_INCREF(arg);
self->wtext = arg;
...

If you ever need to iterate over the low-level characters in the Unicode string, there is a macro which returns a Py_UNICODE *:

int i = 0;
Py_ssize_t size = PyUnicode_GetSize(self->wtext);
Py_UNICODE *chars = PyUnicode_AS_UNICODE(self->wtext);
for (i = 0; i < size; i++) {
    // use chars[i]
    ...
Sign up to request clarification or add additional context in comments.

2 Comments

I see. If I'm not mistaken, though, the Python reference seems to recommend the use of T_OBJECT_EX over T_OBJECT due to how certain cases are handled.
Yep, you could use T_OBJECT_EX instead. For a READONLY attribute (which cannot be deleted) a T_OBJECT should also work fine. Choice also depends on whether you want a NULL value for self->wtext to raise an error or just return None, which really depends on the behavior you want your object to exhibit.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.