1

I ran into this performance issue when I tried to convert a huge 2D list of binary data to decimal numbers.

Given a list:

biglist = [
  [[1,0],[0,0],[1,1]],
  [[0,0],[1,1],[1,0]],
  #...
  #easily go to thousands of rows
]

In each row, I want combine all first element of each column and convert it into a decimal number:

E.g.

In row 0, I want int('101',2) i.e. 5

In row 1, I want int('011',2) i.e. 3

My final goal is to create a dictionary that counts what integer appears how many times. Considering the given data in the example above, the final result should be a dictionary with {key:value} pair like {a_int : appearance_count} like this:

{{5:1},{3:1}}

Now my solution is this:

result = {}
for k in biglist:
    num = int("".join(str(row[0]) for row in k), 2)
    #count
    if num not in result:
        result[num] = 1
    else:
        result[num] += 1

This loop is slow for a list of thousands of rows, is there a better solution?

4
  • I guess if your code is working fine but you want to enhance it, you should post this in Code Review Stack Exchange Commented Oct 15, 2017 at 16:21
  • What the range of integer values possible? Commented Oct 15, 2017 at 16:24
  • Are each of your "rows" the same amount of columns? ie - you're only dealing with 3 bits here... or could they be much larger? Commented Oct 15, 2017 at 16:46
  • @JonClements Yes their widths are all the same and they can be any positive number of bits. Commented Oct 15, 2017 at 22:10

3 Answers 3

1

Just collect bits of integer value instead of string-int transformations: (pseudocode)

for every row:
    value = 0
    for every col:
        value = (value << 1) | biglist[row][col][0]  # bitwise shift left and OR

   #equivalent  operation:
        value = value * 2 +  biglist[row][col][0]  
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. This has improved the performance compared to using string join and has improved it significant enough for me to meet the performance requirement without using extra library.
1

First, using string to int conversion with join is convenient, but slow. Compute the value from the powers of 2 classicaly, using sum, enumerate and bit shift on the ones (skip the zeroes)

Second, you should use collections.Counter for this

In one line:

result = collections.Counter(sum(v[0]<<(len(k)-i-1) for i,v in enumerate(k) if v[0]) for k in biglist)

this code runs 30% faster as your original code on my machine.

1 Comment

Thank you. It is indeed a good idea to use collections.Counter to count. The performance improvement is remarkable.
1

If you need performance, you should use numpy or numba, which all low level routines are done at nearly C speed :

import numpy as np
bigarray=np.random.randint(0,2,10**4*3*2).reshape(10**4,3,2)
biglist=[[[e for e in B] for B in A] for A in bigarray]
# [[[1, 0], [0, 0], [1, 0]],
#  [[0, 0], [0, 1], [0, 1]], 
#  [[1, 0], [1, 0], [0, 0]], ...

def your_count(biglist):
    integers=[]
    for k in biglist:
        num = int("".join(str(row[0]) for row in k), 2)
        integers.append(num)
    return integers

def count_python(big):
    m=len(big)
    integers=np.empty(m,np.int32)
    for i in range(m):
        n=len(big[i])
        b=1
        s=0
        for j in range(n-1,-1,-1):
               s = s+big[i][j][0]*b
               b=b*2
        integers[i]=s
    return integers

def count_numpy(bigarray): 
integers=(bigarray[:,:,0]*[4,2,1]).sum(axis=1)
return integers

from numba import njit    
count_numba =njit(count_python)

And some tests:

In [125]: %timeit your_count(biglist)
145 ms ± 22.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [126]: %timeit count_python(biglist)
29.6 ms ± 1.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [127]: %timeit count_numpy(bigarray)
354 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [128]: %timeit count_numba(bigarray)
73 µs ± 938 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Numba allow you to compile a low level version of some python codes (not yours because Numba don't manage strings and list, only numpy arrays). Numpy give you special syntax to make fantastic things in one instruction, for good performances.

The Numba solution is here 2000x faster than yours.

The counts are efficiently computed by collections.Counter or np.unique :

In [150]: %timeit {k:v for k,v in zip(*np.unique(integers,return_counts=True))} 
46.4 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [151]: %timeit Counter(integers)
218 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

1 Comment

This works and it really helps me understand numpy. Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.