I have made a small Python script that will generate some test sets for my project.
The script generates 2 datasets with the same dimensions n*m. One contains binary values 0,1 and the other contains floats.
The script runs fine and generates the output I need, but if I want to scale to many dimensions the for-loop in pick_random() slows down my computation time.
How can I get rid of it? Perhaps with some array comprehension using numpy?
What throws my reasoning off is the if-stmt. Because the sampling should occur with a probability.
# Probabilities must sum to 1
AMOUNT1 = {0.6 : get_10_20,
0.4 : get_20_30}
AMOUNT2 = {0.4 : get_10_20,
0.6 : get_20_30}
OUTCOMES = [AMOUNT1, AMOUNT2]
def pick_random(prob_dict):
'''
Given a probability dictionary, with the first argument being the probability,
Returns a random number given the probability dictionary
'''
r, s = random.random(), 0
for num in prob_dict:
s += num
if s >= r:
return prob_dict[num]()
def compute_trade_amount(action):
'''
Select with a probability, depending on the action.
'''
return pick_random(OUTCOMES[action])
ACTIONS = pd.DataFrame(np.random.randint(2, size=(n, m)))
AMOUNTS = ACTIONS.applymap(compute_trade_amount)