EDIT: It appears this functionality may be built into python; see comments. I'll leave this answer because it provides a minimal working example of a C library for Python that manipulates arrays, which I was not able to find elsewhere online.
I agree with many of the comments that something has clearly gone wrong if you have a bunch of binary strings in human-readable format sitting around in memory. However, if there are reasons outside your control that this can't be avoided, you could try writing the relevant functionality in C. Here's a straightforward example to start from:
include <Python.h>
static PyObject * binary_string(PyObject * self, PyObject * args);
static PyMethodDef PyBinaryString_methods[] =
{
{ "binary_string", binary_string, METH_VARARGS, "binary string" },
{ NULL, NULL, 0, NULL }
};
static struct PyModuleDef PyBinaryString_module =
{
PyModuleDef_HEAD_INIT,
"PyBinaryString",
"Binary String",
-1,
PyBinaryString_methods
};
PyMODINIT_FUNC PyInit_PyBinaryString(void)
{
return PyModule_Create(&PyBinaryString_module);
}
static PyObject * binary_string(PyObject * self, PyObject * args)
{
const char * string;
char buf[8];
if(!PyArg_ParseTuple(args, "s", &string)) { return NULL; }
for(int i = 0; i < 8; i++)
{
buf[i] = 0;
for(int j = 0; j < 8; j++)
{
buf[i] |= (string[8 * i + j] & 1) << (7 - j);
}
}
return PyByteArray_FromStringAndSize(buf, 8);
}
Here I'm exploiting the fact that the string is going to consist of ASCII '0' and '1' characters exclusively, and that the ASCII code for the former is even whereas the ASCII code for the latter is odd.
On my system I can compile this via
cc -fPIC -shared -O3 -I/usr/include/python -o PyBinaryString.so PyBinaryString.c
and then use it in Python like so:
>>> from PyBinaryString import binary_string
>>> binary_string("1111111111111111111111111111111111111111111111111111111100000000")
bytearray(b'\xff\xff\xff\xff\xff\xff\xff\x00')
I'm not a Python programmer, so someone might be able to provide a better way of getting data in/out of the python object formats. However on my machine this runs about an order of magnitude faster than the native python version.
If you know more about the layout in memory -- say if you know that all the strings of ASCII '0' and '1' characters are contiguous -- you could have the C code convert everything at once, which would probably speed things up further.
awkor C to parse your data.load it into memory from files. The bottleneck may not be in the conversion but in how you're "loading it into memory"[ [i, i+1, i+2, i+3, i+4, i+5, i+6, i+7] for i in range(50000000)]is taking me forever to run. Instead of using list comprehension and create a whole list (which could cost you over 1GB of memory), will you consider using a generator expression instead?