0

I want to subsample a numpy array (shape = (0,n)), such that distribution of elements in train and test remains approximately same or there should be atleast one element in train and test of each class. eg:

a = [1,2,3,1,3,3,2,1,2,1]
train = [1,1,2,2,3,3]
test = [1,1,2,3]

I want to subsample my parameters and outputs based on the outputs. For now, I am using np.random.choice to take random indexes. Is there any way i can check for distribution in python

2
  • If your data set is large enough, compared to the number of unique elements, np.random.choice should do the work. Commented Sep 11, 2017 at 8:48
  • It's small ~100 Commented Sep 11, 2017 at 8:57

1 Answer 1

1

You can use collections built-in library from Python.

>>> from collections import Counter
>>> a = [1,2,3,1,3,3,2,1,2,1]
>>> count_a = Counter(a)
>>> count_a
Counter({1: 4, 2: 3, 3: 3})

The Counter object works like a dictionary. From there, you can sample how many percentage of each element you want, i.e.,

>>> from itertools import chain
>>> train_fraction = 0.7
>>> train = list(chain.from_iterable([[i]*int(max(count_a[i]*train_fraction, 1)) for i in count_a.keys()]))
>>> train
[1, 1, 2, 2, 3, 3]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.