1

Given the following file bug.txt:

event "øat" not handled

I wrote the following Python C Extensions on the file fastfilewrapper.cpp

#include <Python.h>
#include <cstdio>
#include <iostream>
#include <sstream>
#include <fstream>

static PyObject* hello_world(PyObject *self, PyObject *args) {
    printf("Hello, world!\n");
    std::string retval;
    std::ifstream fileifstream;

    fileifstream.open("./bug.txt");
    std::getline( fileifstream, retval );
    fileifstream.close();
    std::cout << "retval " << retval << std::endl;
    return Py_BuildValue( "s", retval.c_str() );
}

static PyMethodDef hello_methods[] = { {
        "hello_world", hello_world, METH_NOARGS,
        "Print 'hello world' from a method defined in a C extension."
    },
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef hello_definition = {
    PyModuleDef_HEAD_INIT,
    "hello", "A Python module that prints 'hello world' from C code.",
    -1, hello_methods
};

PyMODINIT_FUNC PyInit_fastfilepackage(void) {
    Py_Initialize();
    return PyModule_Create(&hello_definition);
}

I built it with pip3 install . using this setup.py

from distutils.core import setup, Extension

# https://bugs.python.org/issue35893
from distutils.command import build_ext

def get_export_symbols(self, ext):
    parts = ext.name.split(".")
    if parts[-1] == "__init__":
        initfunc_name = "PyInit_" + parts[-2]
    else:
        initfunc_name = "PyInit_" + parts[-1]

build_ext.build_ext.get_export_symbols = get_export_symbols

setup(name='fastfilepackage', version='1.0',  \
      ext_modules=[Extension('fastfilepackage', ['fastfilewrapper.cpp'])])

Then, I use this test.py script:

import fastfilepackage

iterable = fastfilepackage.hello_world()
print('iterable', iterable)

But Python throws this exception when I run the test.py Python Script:

$ PYTHONIOENCODING=utf8 python3 test.py
Hello, world!
retval event "▒at" not handled
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    iterable = fastfilepackage.hello_world()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 7: invalid start byte

How can I recover from invalid Unicode characters?

i.e., ignore these errors when binding C and Python.

When purely working with Python, I can use this:

file_in = open( './bug.txt', errors='replace' )
line = file_in.read()
print( "The input line was: {line}".format(line=line) )

What is the equivalent to errors='replace' when binding with Python C Extensions?

1 Answer 1

1

If you want to have 'replace' error handling semantic you should do it on the C side like so and return it to the python side:

return PyUnicode_DecodeUTF8(retval.c_str(), retval.size(), "replace");

This will give in our case sth like:

Hello, world!
retval event "?at" not handled
iterable event "�at" not handled
Sign up to request clarification or add additional context in comments.

3 Comments

retval (std::string) has a member called size() which has constant complexity O(1), then, there is no need to use strlen(str) with complexity O(n).
retval.c_str() also has constant complexity O(1) (C++ 11 and later), so there is no need to allocate a char* to store its value. The fix would just be replace the line return Py_BuildValue( "s", retval.c_str() ); with return PyUnicode_DecodeUTF8(retval.c_str(), retval.size(), , "replace"); as PyUnicode_DecodeUTF8 already returns a PyObject * and that is exacly what I need to return from hello_world.
I have updated the answer as it is now (after the removal of strlen) a more concise solution. Regarding the assignment const char *cStr = retval.c_str();: There is no dynamically memory allocation for the string value, but only the pointer to the string is simply assigned to a variable, which is negligible in terms of performance in any case.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.