Generating large testing datasets

Question

I have made a small Python script that will generate some test sets for my project.

The script generates 2 datasets with the same dimensions n*m. One contains binary values 0,1 and the other contains floats.

The script runs fine and generates the output I need, but if I want to scale to many dimensions the for-loop in pick_random() slows down my computation time. How can I get rid of it? Perhaps with some array comprehension using numpy?

What throws my reasoning off is the if-stmt. Because the sampling should occur with a probability.

# Probabilities must sum to 1
AMOUNT1 = {0.6 : get_10_20,
           0.4 : get_20_30}

AMOUNT2 = {0.4 : get_10_20,
           0.6 : get_20_30}

OUTCOMES = [AMOUNT1, AMOUNT2]

def pick_random(prob_dict):
    '''
    Given a probability dictionary, with the first argument being the probability,
    Returns a random number given the probability dictionary
    '''
    r, s = random.random(), 0
    for num in prob_dict:
        s += num
        if s >= r:
            return prob_dict[num]()


def compute_trade_amount(action):
    '''
    Select with a probability, depending on the action.
    '''
    return pick_random(OUTCOMES[action])


ACTIONS = pd.DataFrame(np.random.randint(2, size=(n, m)))
AMOUNTS = ACTIONS.applymap(compute_trade_amount)

Graipher · Accepted Answer · 2020-01-21 14:39:21Z

There are already tools for choosing elements from a collection with some given probabilities. In the standard library there is random.choices, which takes the values, the probabilities and the number of items you want:

import random

values = ["get_10_20", "get_20_30"]
#p = [0.5 * 0.6 + 0.5 * 0.4, 0.5 * 0.4 + 0.5 * 0.6]
p = [0.5, 0.5]

n = 10
random.choices(values, weights=p, k=n)
# ['get_10_20', 'get_10_20', 'get_10_20', 'get_10_20', 'get_10_20', 'get_20_30',
#  'get_20_30', 'get_20_30', 'get_10_20', 'get_10_20']

The other possibility is to use numpy.random.choice for this, which allows you to directly generate multi-dimensional data:

values = ["get_10_20", "get_20_30"]
p = [0.5, 0.5]

n, m = 2, 3
np.random.choice(values, p=p, size=(n, m))
# array([['get_10_20', 'get_10_20', 'get_20_30'],
#        ['get_20_30', 'get_20_30', 'get_20_30']], dtype='<U9')

Both approaches assume that you can combine the probabilities to a total probability for each value. The numpy method enforces that they sum to one, while the standard library random does that for you, if necessary.

Peilonrayz · Accepted Answer · 2020-01-21 10:20:58Z

Logically you have two different forms of chance. The 'top level' to pick between AMOUNT1 and AMOUNT2 and then the second form to pick the function get_10_20 or get_20_30. This requires you to generate two random numbers. A pretty expensive operation.

Secondly you're performing a cumulative sum each and every time you're looping through either AMOUNT. This is just a waste when you can do it once. And so you should aim to normalize your input data.

Your input format is pretty poor, what if I want a 50/50 chance, or a 1:1:1 chance. To do so a cumulative sum would work best. But lets get your input to work in a meh way.

AMOUNTS = [
    {
        0.6 : get_10_20,
        0.4 : get_20_30,
    },
    {
        0.4 : get_10_20,
        0.6 : get_20_30,
    }
]

def normalize_input(amounts):
    by_value = {}
    for chances in amounts:
        for chance, value in chances.items():
            by_value.setdefault(value, 0)
            by_value[value] += chance / len(amounts)

    total_chance = 0
    output = {}
    for value, chance in by_value.items():
        total_chance += chance
        output[total_chance] = value
    return output

This works hunkydory and returns:

{
    0.5: get_10_20,
    1.0: get_20_30,
}

_{Note: untested, provided purely as an example.}

def pick_random(chances):
    def inner(chance):
        for c in chances:
            if c >= chance:
                return chances[c]()
    return inner


CHANCES = pd.DataFrame(np.random.random(size=(n, m)))
AMOUNTS = CHANCES.applymap(pick_random(normalize_input(OUTCOMES)))

But do you really need that for loop? From the looks of it no, not really.
You could change your input to integers and provide your chances in cumulative ratios so you could have:

{
    1: foo,
    2: bar,
    4: baz,
}

From this you can then normalize to an array, and just index the array. Using np.random.randint rather than np.random.random. The array would look like:

[foo, bar, baz, baz]

Which would make inner super simple:

def inner(chance):
    return chances[chance]()

Stack Exchange Network

Generating large testing datasets

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Generating large testing datasets

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions