7

Say I have a very large dataframe, which I want to sample to match the distribution of a column of the dataframe as closely as possible (in this case, the 'bias' column).

I run:

train['bias'].value_counts(normalize=True)

and see:

least           0.277220
left            0.250000
right           0.250000
left-center     0.141244
right-center    0.081536

If I want to take a sample of the train dataframe where the distribution of the sample's 'bias' column matches this distribution, what would be the best way to go about it?

3 Answers 3

6

You can use sample, from the documentation:

Return a random sample of items from an axis of object.

The trick is to use sample in each group, a code example:

import pandas as pd

positions = {"least": 0.277220, "left": 0.250000, "right": 0.250000, "left-center": 0.141244, "right-center": 0.081536}
data = [['title-{}-{}'.format(i, position), position] for i in range(1000) for position in positions.keys()]
frame = pd.DataFrame(data=data, columns=['title', 'position'])
print(frame.shape)


def sample(obj, replace=False, total=1000):
    return obj.sample(n=int(positions[obj.name] * total), replace=replace)

result = frame.groupby('position', as_index=False).apply(sample).reset_index(drop=True)
print(result.groupby('position').agg('count'))

Output

(5000, 2)
              title
position           
least           277
left            250
left-center     141
right           250
right-center     81

In the above example I created a dataframe with 5000 rows and 2 columns, first part of the output.

I am assuming you have a positions dictionary (to convert a DataFrame to dictionary see this) with the percentage to be sample from each group and a total parameter (i.e. the total to be sample).

In the second part of the output you can see you have 277 least rows out of 100, 277 / 1000 = 0.277. That is an approximation of the required, the same goes for the rest of the groups. There is a caveat though, the count of the samples is 999 instead of the intended 1000.

Sign up to request clarification or add additional context in comments.

9 Comments

Thank you. I figured you would use pd.sample, but I was having difficulty figuring out the form weights wanted as input.
On second thought, this doesn't seem to be working. My data has many observations, and the least, left, right probabilities are derived from taking the value counts of my data's bias column and normalizing it. I would like to sample my original dataframe so that the sample contains approximately 27.72% least observations, 25% right observations, etc
Why it doesn't seems to be working could you be more specific?
If I'm not mistaken, your code seems to be sampling your constructed 'frame', which only contains the position and biases column. My data consists of many more observations, which all have an associated bias value. I want to sample this dataframe so the sample contains distribution of bias values similar to the original dataframe.
In the example above, frame is to be consider as a replacement of your original dataframe. Could you provide an example of your original dataframe. Note that sample could be applied to your original dataframe.
|
1

Here is a one liner to sample based on a distribution

positions = {"least": 0.277220, "left": 0.250000, "right": 0.250000, "left-center": 0.141244, "right-center": 0.081536}
total = len(df)

df = pd.concat([df[df['position'] == k].sample(int(v * total), replace=False) for k, v in fps_dict.items()])

Comments

0

Target-matched sampling:

A general version of this problem is: "I have a source column, which I want to sample such that it matches the distribution of the target column", where both columns are discrete i.e. categories, not floating-point numbers.

The solution mentioned above requires grouping and sampling from each group.

Here's an alternative, which directly outputs the sampled indexes, and also allows specifying the total size of the resulting sample:

def values_dist(vals: Union[List, Tuple, np.ndarray, pd.Series]) -> pd.Series:
    assert isinstance(vals, (list, tuple, np.ndarray, pd.Series))
    val_counts: pd.Series = pd.Series(Counter(vals))  ## Includes nan and None as keys.
    return val_counts / val_counts.sum()


def sample_idxs_match_distribution(
        source: Union[List, Tuple, np.ndarray, pd.Series],
        target: Union[List, Tuple, np.ndarray, pd.Series],
        n: Optional[int] = None,
        seed: Optional[int] = None,
        shuffle: bool = True,
        target_is_dist: bool = False,
) -> np.ndarray:
    """
    Values from current series based on another distribution, and return randomly-shuffled indexes from the source.
    Selecting these indexes will give a distribution from the source whicha matches that of the target distribution.
    """
    if not target_is_dist:
        target_prob_dist: pd.Series = values_dist(target)
    else:
        target_prob_dist: pd.Series = target
    assert isinstance(target_prob_dist, pd.Series)
    assert abs(float(target_prob_dist.sum()) - 1.0) <= 1e-2  ## Sum of probs should be exactly or very close to 1.

    assert isinstance(source, (list, tuple, np.ndarray, pd.Series))
    source_vc: pd.Series = pd.Series(Counter(source))
    # print(f'\nsource_vc:\n{source_vc}')
    # print(f'\ntarget_prob_dist:\n{target_prob_dist}')
    missing_source_vals: Set = set(target_prob_dist.index) - set(source_vc.index)
    if len(missing_source_vals) > 0:
        raise ValueError(f'Cannot sample; the following values are missing in the source: {missing_source_vals}')

    n: int = get_default(n, len(source))
    max_n_sample: pd.Series = (source_vc / target_prob_dist).apply(
        lambda max_n_sample_category: min(max_n_sample_category, n),
    )
    # print(f'\n\nmax_n_sample:\n{max_n_sample}')
    max_n_sample: int = math.floor(min(max_n_sample.dropna()))
    # print(f'Max possible sample size: {max_n_sample}')
    source_value_wise_count_to_sample: pd.Series = (target_prob_dist * max_n_sample).round(0).astype(int)
    source_value_wise_count_to_sample: Dict[Any, int] = source_value_wise_count_to_sample.to_dict()
    ## Select random indexes:
    source_val_idxs: Dict[Any, List[int]] = {val: [] for val in source_vc.index}
    for idx, val in enumerate(source):
        if val in source_value_wise_count_to_sample:
            source_val_idxs[val].append(idx)
    sampled_idxs: np.array = np.array(flatten1d([
        random_sample(source_val_idxs[val], n=req_source_val_count, seed=seed)
        for val, req_source_val_count in source_value_wise_count_to_sample.items()
    ]))
    if shuffle:
        sampled_idxs: np.ndarray = np.random.RandomState(seed).permutation(sampled_idxs)
    return sampled_idxs

Usage: taking the largest sample-size possible:

For example:

bias_dist = pd.Series({
    "least": 0.277220,
    "left": 0.250000,
    "right": 0.250000,
    "left-center": 0.141244,
    "right-center": 0.081536,
})
source = pd.Series(flatten1d([
    ['least'] * 500, 
    ['left']*300, 
    ['right']*100, 
    ['left-center']*200, 
    ['right-center']*1000,
]))

idxs = sample_idxs_match_distribution(
    source,
    target=bias_dist,
    target_is_dist=True,
)
matched_source = source.iloc[idxs]
print(matched_source.value_counts(normalize=False))
print()
print(matched_source.value_counts(normalize=True))

Output:

least           111
left            100
right           100
left-center      56
right-center     33
dtype: int64

least           0.2775
left            0.2500
right           0.2500
left-center     0.1400
right-center    0.0825
dtype: float64

Usage: Restricting to a certain sample-size:

If you additionally pass n, you can restrict to a certain sample-size:

idxs = sample_idxs_match_distribution(
    source,
    target=bias_dist,
    target_is_dist=True,
    n=100,
)
matched_source = source.iloc[idxs]
print(matched_source.value_counts(normalize=False))
print()
print(matched_source.value_counts(normalize=True))

Output:

least           28
right           25
left            25
left-center     14
right-center     8
dtype: int64

least           0.28
right           0.25
left            0.25
left-center     0.14
right-center    0.08
dtype: float64

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.