10

I have a large buffer of strings (basically 12GB) from a C app.

I would like to create PyString objects in C for an embedded Python interpreter without copying the strings. Is this possible?

9
  • Anything is possible in computing, given enough time, money and computing resources. Is that really your question? Commented Jul 31, 2014 at 19:54
  • I would like to do this without rewriting PyString. Commented Jul 31, 2014 at 19:55
  • 1
    @RobertHarvey no that example uses a copy. See docs.python.org/2/c-api/string.html#PyString_FromStringAndSize Commented Jul 31, 2014 at 20:12
  • 3
    BufferProtocols and NumPy works this way, just give the c pointer. I was hoping there is a way to do this with strings. Commented Jul 31, 2014 at 20:15
  • 1
    @Santa do you have an example of calling ctypes from C to an embedded Python interpreter? Commented Jul 31, 2014 at 20:34

2 Answers 2

7

I don't think that is possible for the basic reason that Python String objects are embedded into the PyObject structure. In other words, the Python string object is the PyObject_HEAD followed by the bytes of the string. You would have to have room in memory to put the PyObject_HEAD information around the existing bytes.

Sign up to request clarification or add additional context in comments.

1 Comment

Can I just use numpy.str_? It seems these have problems comparing to other PyStrings though.
7

One can't use PyString without a copy, but one can use ctypes. Turns out that ctypes.c_char_p works basically like a string. For example with the following C code:

static char* names[7] = {"a", "b", "c", "d", "e", "f", "g"};                                      
PyObject *pFunc, *pArgs, *pValue;
pFunc = td_py_get_callable("my_func");
pArgs = PyTuple_New(2);
pValue = PyLong_FromSize_t((size_t) names);
PyTuple_SetItem(pArgs, 0, pValue);
pValue = PyLong_FromLong(7);
PyTuple_SetItem(pArgs, 1, pValue);
pValue = PyObject_CallObject(pFunc, pArgs);

One can then pass the address and the number of character strings With the following python my_func:

def my_func(names_addr, num_strs):
    type_char_p = ctypes.POINTER(ctypes.c_char_p)
    names = type_char_p.from_address(names_addr)
    for idx in range(num_strs):
        print(names[idx])

Of course who really wants to pass around a address and a length in Python. We can put these in a numpy array and pass around then cast if we need to use them:

def my_func(name_addr, num_strs):
    type_char_p = ctypes.POINTER(ctypes.c_char_p)
    names = type_char_p.from_address(names_addr)
    // Cast to size_t pointers to be held by numpy
    p = ctypes.cast(names, ctypes.POINTER(ctypes.c_size_t))
    name_addrs = numpy.ctypeslib.as_array(p, shape=(num_strs,))
    // pass to some numpy functions
    my_numpy_fun(name_addrs)

The challenge is that evaluating the indices of numpy arrays is only going to give you an address, but the memory is the same as the original c pointer. We can cast back to a ctypes.POINTER(ctypes.c_char_p) to access values:

def my_numpy_func(name_addrs):
    names = name_addrs.ctypes.data_as(ctypes.POINTER(ctypes.c_char_p))
    for i in range(len(name_addrs)):
        print names[i]

It's not perfect as I can't use things like numpy.searchsorted to do a binary search at the numpy level, but it does pass around char* without a copy well enough.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.