1

I have a set of existing data, lets say:

sample_data = [2,2,2,2,2,2,3,3,3,3,4,4,4,4,4]

off of this sample data, i would like to generate a random set of data of a certain length. This should not be off of the sample data, but off of a distribution that was generated off of the sample data.

expected output if i wanted 5 random points:

output_data = [3.4,2.3,1.5,5.2,1.3]

3
  • possible duplicate of: stackoverflow.com/questions/22741319/… Commented Feb 1, 2019 at 17:38
  • Provide expected output from above input. Commented Feb 1, 2019 at 18:11
  • provided expected output. Commented Feb 1, 2019 at 18:51

3 Answers 3

2

Use random.sample :

import random

sample_data = [2,2,2,2,2,2,3,3,3,3,4,4,4,4,4]
# if you want to select 5 samples from above data
print(random.sample(sample_data, 5))

Output:

[3, 2, 2, 4, 2]
Sign up to request clarification or add additional context in comments.

4 Comments

hey - I dont want to select x amount of samples from the data, but rather generate data based on the existing data.
what's the difference between you prior and later sentence? Maybe you need to edit the question and elaborate further.
To clarify - I would like to find a distribution fit off of a data set, and then create a random set of data based off of that distribution.
@BrianChen This is not what was asked in the question, please edit.
1
import numpy as np
length = 3
sample_data = [2,2,2,2,2,2,3,3,3,3,4,4,4,4,4]

np.random.choice(sample_data, length, False) #Sampling without replacement
Out[287]: array([4, 4, 2])

6 Comments

hey - i dont want to select x amount of samples from the data, but rather generate data based on the existing data.
@BrianChen just remove False from the above code and run the code with length being 30 for example
it is still just outputting values from the data set - not generating new data points based off of a distribution.
What do you mean by generating new data points based of a distribution? can you elaborate more?
hey, thanks for replying - i would like for python to determine what kind of distribution the data best fits to (as well as parameters) and use this data to create x amount of random data from this new distribution/parameters. For example, my data set best fits a normal distribution of (10,1), then use this normal distribution of (10,1) to generate 15 new data points
|
1

There's an important premise of the question that needs to be decided: what kind of distribution do you want?. Now as humans we probably can classify distribution by the shape of it, when we have enough data. But machines don't, to install an distribution type, say uniform or binomial to a new input is arbitrary. Here I'll provide a brief answer with the gold standard of statistic - normal distribution (according to Central Limit Theorem, sufficient large sample sizes converge to normal)

import numpy as np

sample_data = [2,2,2,2,2,2,3,3,3,3,4,4,4,4,4]
size = 5
new_samples = np.random.normal(np.mean(sample_data), np.std(sample_data), size)

>>> new_samples
array([ 2.01221231,  2.62772975,  1.79965428,  3.83601719,  2.44967777])

The new samples are generated by a normal distribution that assume the mean and standard deviation of the original samples.

1 Comment

hey, thanks for replying - i would like for python to determine what kind of distribution the data best fits to (as well as parameters) and use this data to create x amount of random data from this new distribution/parameters. For example, my data set best fits a normal distribution of (10,1), then use this normal distribution of (10,1) to generate 15 new data points.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.