Randomly sampling Pandas dataframe based on distribution of column

Question

Say I have a very large dataframe, which I want to sample to match the distribution of a column of the dataframe as closely as possible (in this case, the 'bias' column).

I run:

train['bias'].value_counts(normalize=True)

and see:

least           0.277220
left            0.250000
right           0.250000
left-center     0.141244
right-center    0.081536

If I want to take a sample of the train dataframe where the distribution of the sample's 'bias' column matches this distribution, what would be the best way to go about it?

Dani Mesejo · Accepted Answer · 2018-09-26 14:41:11Z

6

You can use sample, from the documentation:

Return a random sample of items from an axis of object.

The trick is to use sample in each group, a code example:

import pandas as pd

positions = {"least": 0.277220, "left": 0.250000, "right": 0.250000, "left-center": 0.141244, "right-center": 0.081536}
data = [['title-{}-{}'.format(i, position), position] for i in range(1000) for position in positions.keys()]
frame = pd.DataFrame(data=data, columns=['title', 'position'])
print(frame.shape)


def sample(obj, replace=False, total=1000):
    return obj.sample(n=int(positions[obj.name] * total), replace=replace)

result = frame.groupby('position', as_index=False).apply(sample).reset_index(drop=True)
print(result.groupby('position').agg('count'))

Output

(5000, 2)
              title
position           
least           277
left            250
left-center     141
right           250
right-center     81

In the above example I created a dataframe with 5000 rows and 2 columns, first part of the output.

I am assuming you have a positions dictionary (to convert a DataFrame to dictionary see this) with the percentage to be sample from each group and a total parameter (i.e. the total to be sample).

In the second part of the output you can see you have 277 least rows out of 100, 277 / 1000 = 0.277. That is an approximation of the required, the same goes for the rest of the groups. There is a caveat though, the count of the samples is 999 instead of the intended 1000.

edited Sep 26, 2018 at 14:41

answered Sep 25, 2018 at 17:26

Dani Mesejo

62.2k6 gold badges56 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Jonathan Miller Over a year ago

Thank you. I figured you would use pd.sample, but I was having difficulty figuring out the form weights wanted as input.

Jonathan Miller Over a year ago

On second thought, this doesn't seem to be working. My data has many observations, and the least, left, right probabilities are derived from taking the value counts of my data's bias column and normalizing it. I would like to sample my original dataframe so that the sample contains approximately 27.72% least observations, 25% right observations, etc

Dani Mesejo Over a year ago

Why it doesn't seems to be working could you be more specific?

Jonathan Miller Over a year ago

If I'm not mistaken, your code seems to be sampling your constructed 'frame', which only contains the position and biases column. My data consists of many more observations, which all have an associated bias value. I want to sample this dataframe so the sample contains distribution of bias values similar to the original dataframe.

Dani Mesejo Over a year ago

In the example above, frame is to be consider as a replacement of your original dataframe. Could you provide an example of your original dataframe. Note that sample could be applied to your original dataframe.

|

Kenan · Accepted Answer · 2019-05-20 20:55:52Z

1

Here is a one liner to sample based on a distribution

positions = {"least": 0.277220, "left": 0.250000, "right": 0.250000, "left-center": 0.141244, "right-center": 0.081536}
total = len(df)

df = pd.concat([df[df['position'] == k].sample(int(v * total), replace=False) for k, v in fps_dict.items()])

answered May 20, 2019 at 20:55

Kenan

14.2k9 gold badges47 silver badges56 bronze badges

Comments

Abhishek Divekar · Accepted Answer · 2023-11-15 15:19:33Z

Target-matched sampling:

A general version of this problem is: "I have a source column, which I want to sample such that it matches the distribution of the target column", where both columns are discrete i.e. categories, not floating-point numbers.

The solution mentioned above requires grouping and sampling from each group.

Here's an alternative, which directly outputs the sampled indexes, and also allows specifying the total size of the resulting sample:

def values_dist(vals: Union[List, Tuple, np.ndarray, pd.Series]) -> pd.Series:
    assert isinstance(vals, (list, tuple, np.ndarray, pd.Series))
    val_counts: pd.Series = pd.Series(Counter(vals))  ## Includes nan and None as keys.
    return val_counts / val_counts.sum()


def sample_idxs_match_distribution(
        source: Union[List, Tuple, np.ndarray, pd.Series],
        target: Union[List, Tuple, np.ndarray, pd.Series],
        n: Optional[int] = None,
        seed: Optional[int] = None,
        shuffle: bool = True,
        target_is_dist: bool = False,
) -> np.ndarray:
    """
    Values from current series based on another distribution, and return randomly-shuffled indexes from the source.
    Selecting these indexes will give a distribution from the source whicha matches that of the target distribution.
    """
    if not target_is_dist:
        target_prob_dist: pd.Series = values_dist(target)
    else:
        target_prob_dist: pd.Series = target
    assert isinstance(target_prob_dist, pd.Series)
    assert abs(float(target_prob_dist.sum()) - 1.0) <= 1e-2  ## Sum of probs should be exactly or very close to 1.

    assert isinstance(source, (list, tuple, np.ndarray, pd.Series))
    source_vc: pd.Series = pd.Series(Counter(source))
    # print(f'\nsource_vc:\n{source_vc}')
    # print(f'\ntarget_prob_dist:\n{target_prob_dist}')
    missing_source_vals: Set = set(target_prob_dist.index) - set(source_vc.index)
    if len(missing_source_vals) > 0:
        raise ValueError(f'Cannot sample; the following values are missing in the source: {missing_source_vals}')

    n: int = get_default(n, len(source))
    max_n_sample: pd.Series = (source_vc / target_prob_dist).apply(
        lambda max_n_sample_category: min(max_n_sample_category, n),
    )
    # print(f'\n\nmax_n_sample:\n{max_n_sample}')
    max_n_sample: int = math.floor(min(max_n_sample.dropna()))
    # print(f'Max possible sample size: {max_n_sample}')
    source_value_wise_count_to_sample: pd.Series = (target_prob_dist * max_n_sample).round(0).astype(int)
    source_value_wise_count_to_sample: Dict[Any, int] = source_value_wise_count_to_sample.to_dict()
    ## Select random indexes:
    source_val_idxs: Dict[Any, List[int]] = {val: [] for val in source_vc.index}
    for idx, val in enumerate(source):
        if val in source_value_wise_count_to_sample:
            source_val_idxs[val].append(idx)
    sampled_idxs: np.array = np.array(flatten1d([
        random_sample(source_val_idxs[val], n=req_source_val_count, seed=seed)
        for val, req_source_val_count in source_value_wise_count_to_sample.items()
    ]))
    if shuffle:
        sampled_idxs: np.ndarray = np.random.RandomState(seed).permutation(sampled_idxs)
    return sampled_idxs

Usage: taking the largest sample-size possible:

For example:

bias_dist = pd.Series({
    "least": 0.277220,
    "left": 0.250000,
    "right": 0.250000,
    "left-center": 0.141244,
    "right-center": 0.081536,
})
source = pd.Series(flatten1d([
    ['least'] * 500, 
    ['left']*300, 
    ['right']*100, 
    ['left-center']*200, 
    ['right-center']*1000,
]))

idxs = sample_idxs_match_distribution(
    source,
    target=bias_dist,
    target_is_dist=True,
)
matched_source = source.iloc[idxs]
print(matched_source.value_counts(normalize=False))
print()
print(matched_source.value_counts(normalize=True))

Output:

least           111
left            100
right           100
left-center      56
right-center     33
dtype: int64

least           0.2775
left            0.2500
right           0.2500
left-center     0.1400
right-center    0.0825
dtype: float64

Usage: Restricting to a certain sample-size:

If you additionally pass n, you can restrict to a certain sample-size:

idxs = sample_idxs_match_distribution(
    source,
    target=bias_dist,
    target_is_dist=True,
    n=100,
)
matched_source = source.iloc[idxs]
print(matched_source.value_counts(normalize=False))
print()
print(matched_source.value_counts(normalize=True))

Output:

least           28
right           25
left            25
left-center     14
right-center     8
dtype: int64

least           0.28
right           0.25
left            0.25
left-center     0.14
right-center    0.08
dtype: float64

Collectives™ on Stack Overflow

Randomly sampling Pandas dataframe based on distribution of column

3 Answers 3

9 Comments

Comments

Target-matched sampling:

Usage: taking the largest sample-size possible:

Usage: Restricting to a certain sample-size:

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

Comments

Target-matched sampling:

Usage: taking the largest sample-size possible:

Usage: Restricting to a certain sample-size:

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related