3

I have billions lines of string like this: 1010101110100111100100101 in memory. I need to convert this into binary integer list. It will take minutes, seems too slow. My code :

def string2vec(binary_str):
    return [int(binary_str[i:i + 8], base=2) for i in range(0, 64, 8)]

result= [ string2vec(l) for l in lines ]  # this code is slow


binary_str length is 64, and every 8 binary chars into 1 binary int.

10
  • Have you tried parsing it 64 bits at a time? then you can extract the 8-bit chunks from those “large” integers using bitshift operations. Commented Nov 25, 2019 at 7:29
  • 4
    This is a perfect example of why you do not store numbers as binary (0,1) characters. How do these numbers get into memory? Read from a file? Then you should think about using awk or C to parse your data. Commented Nov 25, 2019 at 8:18
  • 1
    You may need to show more of your code; specifically how you load it into memory from files. The bottleneck may not be in the conversion but in how you're "loading it into memory" Commented Nov 25, 2019 at 8:35
  • 1
    ok, that means it is almost 60,000,000 lines. Each lines creates 8 int, which means in total you are allocating at least 480MB of data just for the integers. The bottleneck is not really cause by your string-to-int logic then. A simple [ [i, i+1, i+2, i+3, i+4, i+5, i+6, i+7] for i in range(50000000)] is taking me forever to run. Instead of using list comprehension and create a whole list (which could cost you over 1GB of memory), will you consider using a generator expression instead? Commented Nov 25, 2019 at 9:58
  • 1
    Have done an experiment with 64 bit int parsing + bitwise operation. It is taking less than half of the time. May OP advise in what case the result is not correct? Commented Nov 26, 2019 at 1:37

3 Answers 3

4

binary_str length is 64, and every 8 binary chars into 1 binary int.

All that string slicing and Python looping is expensive. Use int(s,2) to convert the whole binary string to integer. Then use array to manage the integers as 64-bit integers and convert to 8-bit integers. You can decide if you want big- or little-endian results for the bytes:

import random
import time
import array

ints = [random.randrange(1<<64) for _ in range(1000)] # Make 1000 integers
strs = [f'{n:064b}' for n in ints]                    # Represent as binary strings
print(f'{ints[0]:016X} {strs[0]}')

start = time.perf_counter()
ints2 = [int(s,2) for s in strs]  # convert all the strings to integers
a = array.array('Q',ints)         # Store in an array.  Q = quadwords (64-bit ints)
a.byteswap()                      # Optional if you want the opposite endian-ness of your machine.
b = array.array('B')              # Another array of bytes
b.frombytes(a.tobytes())          # Populate byte array with the bytes from the quadword array.
print(time.perf_counter() - start)

assert ints == ints2
print([hex(n) for n in b[:8]])

Output:

1E27DFA21406A338 0001111000100111110111111010001000010100000001101010001100111000
0.0005346000000372442
['0x1e', '0x27', '0xdf', '0xa2', '0x14', '0x6', '0xa3', '0x38']

My machine is little-endian (most are). This converts one thousand 64-digit binary strings into integers, stores them in an array, byte swaps them to represent big-endian, then remaps the bytes of the array to a byte array...all in 534.6 microseconds on my machine. I've displayed the first 64-character string and its hexadecimal representation, and the first 8 bytes of the final result. If you truly have "billions" of these strings, it'll take about 9 minutes per billion, but don't read them all into memory at once :)

Sign up to request clarification or add additional context in comments.

Comments

4

EDIT: It appears this functionality may be built into python; see comments. I'll leave this answer because it provides a minimal working example of a C library for Python that manipulates arrays, which I was not able to find elsewhere online.

I agree with many of the comments that something has clearly gone wrong if you have a bunch of binary strings in human-readable format sitting around in memory. However, if there are reasons outside your control that this can't be avoided, you could try writing the relevant functionality in C. Here's a straightforward example to start from:

include <Python.h>

static PyObject * binary_string(PyObject * self, PyObject * args);

static PyMethodDef PyBinaryString_methods[] =
{
  { "binary_string", binary_string, METH_VARARGS, "binary string" },
  { NULL, NULL, 0, NULL }
};

static struct PyModuleDef PyBinaryString_module =
{
  PyModuleDef_HEAD_INIT,
  "PyBinaryString",
  "Binary String",
  -1,
  PyBinaryString_methods
};

PyMODINIT_FUNC PyInit_PyBinaryString(void)
{
  return PyModule_Create(&PyBinaryString_module);
}

static PyObject * binary_string(PyObject * self, PyObject * args)
{
  const char * string;

  char buf[8];

  if(!PyArg_ParseTuple(args, "s", &string)) { return NULL; }

  for(int i = 0; i < 8; i++)
  {
    buf[i] = 0;

    for(int j = 0; j < 8; j++)
    {
      buf[i] |= (string[8 * i + j] & 1) << (7 - j);
    }
  }

  return PyByteArray_FromStringAndSize(buf, 8);
}

Here I'm exploiting the fact that the string is going to consist of ASCII '0' and '1' characters exclusively, and that the ASCII code for the former is even whereas the ASCII code for the latter is odd.

On my system I can compile this via

cc -fPIC -shared -O3 -I/usr/include/python -o PyBinaryString.so PyBinaryString.c

and then use it in Python like so:

>>> from PyBinaryString import binary_string
>>> binary_string("1111111111111111111111111111111111111111111111111111111100000000")
bytearray(b'\xff\xff\xff\xff\xff\xff\xff\x00')

I'm not a Python programmer, so someone might be able to provide a better way of getting data in/out of the python object formats. However on my machine this runs about an order of magnitude faster than the native python version.

If you know more about the layout in memory -- say if you know that all the strings of ASCII '0' and '1' characters are contiguous -- you could have the C code convert everything at once, which would probably speed things up further.

1 Comment

int(s,2).to_bytes(8,'big') does the same thing.
0

Since there are only 2^8 = 256 possible values, you could construct a lookup table (in the form of a dict) containing 8-character strings as keys as corresponding integers as values.

2 Comments

Thank you . I have tired, but it only improved little.
it is not free to calculate the hash code of the string. I doubt the hash code calculation is going to be obviously faster than int(str, base)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.