3

I want to generate 200 random binary arrays of size 30, but I need all of them to be unique! I use to following to generate the arrays:

import numpy as np

parents = []
for i in range(200):
    parents.append(np.random.choice([0, 1], size=(30)))
parents = np.vstack(parents)

Are the arrays made in this way already unique (if so, how may I check that)? If not, how do I modify my code so I get unique arrays?

5
  • 1
    No, they are not, and there are 0 guarantees that they ever will be. Your best bet is to deterministically generate unique rows and then permute them. Commented May 2, 2018 at 23:26
  • you can generate them and then compare all arrays for non-unique occurrences to replace them with a new random array. Since 200<2^30 there should be plenty of cases. This way your arrays will be "guaranteed" to be unique Commented May 2, 2018 at 23:33
  • @cᴏʟᴅsᴘᴇᴇᴅ Thanks for your comment! if I use permute/shuffle command in a loop, it will automatically result in unique arrays? Even if this is true, the issue herein is that, regardless the arrays are uniqe, I will be generating arrays in which the number of 1s and 0s are exactly the same (like 5 ones and 25 zeroes). Do you have any idea how I can cope with that? Commented May 2, 2018 at 23:36
  • @anishtain4 True, but how I may find the similar arrays, and how I may replace them with some new arrays that are not already in my list is my question indeed!! Commented May 2, 2018 at 23:38
  • What is your uniqueness condition? Are they assumed to be sets or lists, i.e. are [0,1] and [1,0] the same arrays? Commented May 2, 2018 at 23:42

2 Answers 2

4

They are not unique, since if they were, then they would not be uniformly random each time.

You could perform a quick check each time you generate a new array, and make sure it is unique (using set lookups rather than array lookups for larger parent arrays):

parents = []
for i in range(200):
    unique_found = False
    while not unique_found:
        candidate_array = np.random.choice([0, 1], size=(30))
        if not any((candidate_array == x).all() for x in parents):
            unique_found = True
    parents.append(candidate_array)

However, since there are 1,073,741,824 unique binary arrays of length 30, the probability of getting 2 or more duplicates is:

1 - (1 - (1/1,073,741,824)) ^ (200 choose 2) = 0.0000185, or about 1 time of every 54,000.

So you might be ok with ignoring the problem.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your answer, but I get the following error when I run your code: "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"
Oops, forgot we were using numpy arrays. It'd look something like: not any((candidate_array == x).all() for x in parents
thanks! Could you please explain how you calculated that probability too?! Or at least give me a hint/reference so I can find out how to do so if you do not mind?!
Sure! It's the same as the birthday problem, which is very well documented: en.wikipedia.org/wiki/Birthday_problem
2

Given that the length is large enough, the probability of generating two equal arrays is very low. Rejection sampling will be very fast:

import numpy as np

parents = set()
while len(parents) < 200:
    a = tuple(np.random.choice([0, 1], size=(30)))
    if a not in parents: parents.add(a)
parents = np.array([list(x) for x in parents])

And also, using set() for membership check is faster than arrays.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.