How to generate a M random binary arrays efficiently in Python?

Question

I want to evaluate a binary function at M different locations. The binary functions is of dimension D so the function space to evaluate it is exponential 2^D (i.e. its inputs are all [0,1]^D "strings"). I want its evaluation at M unique/different points that are evenly/randomly distributed in the input space. I know how to generate all permutations if I wanted (essentially just tack a 0 or 1 to all previous permutations that have already been evaluated. (Pseudo)code in the appendix). However, thats exponential and I only need a fraction M points, not all 2^D. Is there a quick way to do it? I thought of generating the binary arrays randomly and then if there was a collision to try another value until it was unique. As in:

while i < n:
    b_D = randomly_generate_bit_array_length(D) # simply selects each bit randomly with 1/2 prob of choosing 0 or 1.
    if not b_n in HashTable[b_D]:
        HashTable[b_D] = b_D
        i++

however, according to the answer I got here that algorithm seems exponential. Is there a better way to do this where the runtime of my algorithm is only M instead of something that depends exponentially on D?

Notice that its not an acceptable answer to try to generate all permutations until we have M points because I need the permutations to be randomly distributed among the binary arrays. i.e. if we start from [0,0,...,0] then generate [1,0,...,0] until we have M is not what I want because most of my arrays will end with 0's and I want to get a unbiased distribution of them.

To generate all permutations:

def _generate_input_data_full(D):
    '''
    full means generate all 2^D.
    '''
    D = D-1
    prev_all_permutations = [ [-1], [1] ]
    current_all_permutations = prev_all_permutations
    for k in range(D):
        current_all_permutations = []
        for prev_perm in prev_all_permutations:
            #print('prev_perm: ', prev_perm)
            for b in [-1,1]:
                new_perm = prev_perm + [b]
                current_all_permutations.append(new_perm)
        prev_all_permutations = current_all_permutations

@StefanPochmann why does that matter? They are variables. Initially will be D=8,M=60,000 but I might increase it to D=256 and M=120,000. — Charlie Parker
– Charlie Parker, Commented Mar 2, 2017 at 4:25
It matters a lot. Your D=8, M=60,000 doesn't even work (you can't find 60000 different points when there are only 256). And with D=256 and M=120,000 you're practically guaranteed to not have any collisions so it's O(MD). It only degenerates to being really bad if you're somewhat close to exhausting the space. — Stefan Pochmann
– Stefan Pochmann, Commented Mar 2, 2017 at 4:29
yea. I guess thats right XD, D and M matter. But I guess since D lead to an exponential function 2^D I tend to forget that small D actually lead reasonable numbers (was thinking to much about the rate of growth). — Charlie Parker
– Charlie Parker, Commented Mar 3, 2017 at 16:23
Another demonstration that D and M matter, or rather that their relationship matters, is in the random.sample source code. It picks one of two different algorithms, depending on how many elements are wanted and on how large the population is. — Stefan Pochmann
– Stefan Pochmann, Commented Mar 3, 2017 at 17:08

Stefan Pochmann · Accepted Answer · 2017-03-02 04:52:45Z

2

Like mentioned in the comments, if you use something like your D=256 and M=120,000 then your generate-random-and-filter-collisions is practically O(MD) because you won't have any collisions at all.

You pretty much only have a chance at getting collisions if you want a huge number of arrays. As you might know from the birthday paradox, you need about sqrt(|space|) elements for a 50% chance to have any collision. So with D=256, you'd need to ask for about M = 2¹²⁸ = 340282366920938463463374607431768211456 arrays. Just for a 50% chance to have any collision.

Relevant news: Just a few days ago the first public SHA-1 collision was published. That's only 160 bits, and many very smart people have spent a lot of time and computer power to intentionally find one collision. You really don't have any practical chance to accidentally create collisions with 256 random bits.

edited Mar 2, 2017 at 4:52

answered Mar 2, 2017 at 4:44

Stefan Pochmann

29k9 gold badges48 silver badges117 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

aghast Over a year ago

"You really don't have any practical chance to accidentally create collisions with 256 random bits." - To the extent that the bits are random, he doesn't get any choice about whether there's a collision or not. :-)

Stefan Pochmann Over a year ago

@AustinHastings Not sure what you're saying. Is there something wrong with what I said?

Charlie Parker Over a year ago

do I really have to code myself the generate all binary arrays? It seems that this functionality must be implemented somewhere!

Stefan Pochmann Over a year ago

@CharlieParker What do you mean "do I really have to..."? Did someone say that? Anyway, there's the random module, there's os.urandom, and NumPy also offers random data functionality. I think those are the best options.

Collectives™ on Stack Overflow

How to generate a M random binary arrays efficiently in Python?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related