0

I wrote the following method (in python 2.7) that generates a set of integers and transform them into binary representation. It takes self-explanatory two parameters: total_num_nodes and dim. It returns numpy matrix-like containing the binary representation of all these integers:

def generate(total_num_nodes, dim):

    # Generate random nodes from the range (0, dim-1) 
    nodes_matrix = [random.randint(0, 2 ** dim - 1) for _ in range(total_num_nodes)]

    # Removes duplicates
    nodes_matrix = list(set(nodes_matrix))

    # Transforms each node from decimal to string representation
    nodes_matrix = [('{0:0' + str(dim) + 'b}').format(x) for x in nodes_matrix]

    # Transforms each bit into an integer.
    nodes_matrix = np.asarray([list(map(int, list(x))) for x in nodes_matrix], dtype=np.uint8)

    return nodes_matrix

The problem is that when I pass very large values, say total_num_nodes= 10,000,000 and dim=128, the generation time takes really long time. A friend of mine hinted me that the following line is actually a bottleneck and it is likely to be responsible for the majority of computation time:

# Transforms each node from decimal to string representation
nodes_matrix = [('{0:0' + str(dim) + 'b}').format(x) for x in nodes_matrix]

I cannot think of other faster method that can replce this line so that I get to speedup the generation time when it is running on a single processor. Any suggestion from you is really really appreciated.

Thank you

5
  • Instead of guessing, why don't you try benchmarking the program?\ Commented Mar 12, 2018 at 1:58
  • 1
    Unfortunately your guess seems to be wrong. Commented Mar 12, 2018 at 2:06
  • @user202729 Thank you very much for this. I never thought about that. I am still a bit intermediate programmer. I am surprised that the mapping is line that takes the majority of time!!! Commented Mar 12, 2018 at 2:09
  • 1
    I will write a complete answer later. For now, a hint (verified on TIO): Use ord. Commented Mar 12, 2018 at 2:12
  • @user202729 Thank you very much. I cannot wait to see your answer. Commented Mar 12, 2018 at 6:44

1 Answer 1

1

Do it all in numpy and it will be faster.

The following generates total_num_nodes rows of dim columns of np.uint8 data, then keeps the unique rows by providing a view of the data suitable for np.unique, then translating back to a 2D array:

import numpy as np

def generate(total_num_nodes, dim):
    a = np.random.choice(np.array([0,1],dtype=np.uint8),size=(total_num_nodes,dim))
    dtype = a.dtype.descr * dim
    temp = a.view(dtype)
    uniq = np.unique(temp)
    return uniq.view(a.dtype).reshape(-1,dim)
Sign up to request clarification or add additional context in comments.

2 Comments

Impressive! Thank you so so much
My method is slower anyway (about 2.5x), so I won't post. ( @Lauren) (using a != '0' seems to be slightly faster than ord(a) - 48)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.