Random sampling pandas based on column values

Question

I have files (A,B,C etc) each having 12,000 data points. I have divided the files into batches of 1000 points and computed the value for each batch. So now for each file we have 12 values, which is loaded in a pandas Data Frame (shown below).

    file    value_1     value_2
0   A           1           43
1   A           1           89
2   A           1           22
3   A           1           87
4   A           1           43
5   A           1           89
6   A           1           22
7   A           1           87
8   A           1           43
9   A           1           89
10  A           1           22
11  A           1           87
12  A           1           83
13  B           0           99
14  B           0           23
15  B           0           29
16  B           0           34
17  B           0           99
18  B           0           23
19  B           0           29
20  B           0           34
21  B           0           99
22  B           0           23
23  B           0           29
24  B           0           34
25  C           1           62
-   -           -           -
-   -           -           -

Now as the next step I need to randomly select a file, and for that file randomly select a sequence of 4 batches for value_1. The later, I believe can be done with df.sample(), but I'm not sure how to randomly select the files. I tried to make it work with np.random.choice(data['file'].unique()), but doesn't seems correct.

Thanks for the help in advance. I'm pretty new to pandas and python in general.

My original files are ascii (.mat) files. I extract the values from the batches and save it to a pandas dataframe similar to the one above. — RnK
– RnK, Commented Sep 3, 2017 at 22:27
Try data[data.file == np.random.choice(data['file'].unique())].sample(n=4). If that does not get you the desired output, then edit the question to add your expected output. — Abdou
– Abdou, Commented Sep 3, 2017 at 22:32
Thanks. But the desired output should be a random sequence of 4 batches. The starting point for the sequence will be random, but the values will be 4 consecutive batches. I'll update this on the question, thanks for the suggestion. — RnK
– RnK, Commented Sep 3, 2017 at 22:41
store all the df into a list , and get the random number , select df . — BENY
– BENY, Commented Sep 4, 2017 at 2:56

Abdou · Accepted Answer · 2017-09-04 13:12:33Z

5

If I understand what you are trying to get at, the following should be of help:

# Test dataframe
import numpy as np
import pandas as pd


data = pd.DataFrame({'file': np.repeat(['A', 'B', 'C'], 12),
                     'value_1': np.repeat([1,0,1],12),
                     'value_2': np.random.randint(20, 100, 36)})
# Select a file
data1 = data[data.file == np.random.choice(data['file'].unique())].reset_index(drop=True)

# Get a random index from data1
start_ix = np.random.choice(data1.index[:-3])

# Get a sequence starting at the random index from the previous step
print(data.loc[start_ix:start_ix+3])

edited Sep 4, 2017 at 13:12

answered Sep 3, 2017 at 22:59

Abdou

13.3k4 gold badges44 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

RnK Over a year ago

This is exactly what I need. One small problem is that I'm getting a KeyError for data.loc[start_ix:start_ix+3].

Abdou Over a year ago

@RnK, what is the value of start_ix when you encounter a KeyError exception. I just tested this 2000 times on a sample dataframe similar to the one in your question but I am getting no KeyError exception.

Abdou Over a year ago

@RnK try resetting the index of data1 before using .loc. If this does not work, please make sure to add the data you're working with in your question. I added the dataframe I am working with.

RnK Over a year ago

Sorry it was my mistake as I did not reset the index. Thanks for the help.

Jason · Accepted Answer · 2017-09-03 23:10:51Z

Here's a rather long winded answer that has a lot of flexibility and uses some random data I generated. I also added a field to the dataframe to denote whether that row had been used.

Generating Data

import pandas as pd
from string import ascii_lowercase
import random

random.seed(44)

files = [ascii_lowercase[i] for i in range(4)]
value_1 = random.sample(range(1, 10), 8)

files_df = files*len(value_1)
value_1_df = value_1*len(files)
value_1_df.sort()
value_2_df = random.sample(range(100, 200), len(files_df))

df = pd.DataFrame({'file' : files_df,
                 'value_1': value_1_df,
                 'value_2': value_2_df,
                  'used': 0})

Randomly Selecting Files

len_to_run = 3 #change to run for however long you'd like
batch_to_pull = 4
updated_files = df.loc[df.used==0,'file'].unique()

for i in range(len_to_run): #not needed if you only want to run once
    file_to_pull = ''.join(random.sample(updated_files, 1))
    print 'file ' + file_to_pull
    for j in range(batch_to_pull): #pulling 4 values
        updated_value_1 = df.loc[(df.used==0) & (df.file==file_to_pull),'value_1'].unique()
        value_1_to_pull = random.sample(updated_value_1,1)
        print 'value_1 ' + str(value_1_to_pull)
        df.loc[(df.file == file_to_pull) & (df.value_1==value_1_to_pull),'used']=1

file a
value_1 [1]
value_1 [7]
value_1 [5]
value_1 [4]
file d
value_1 [3]
value_1 [2]
value_1 [1]
value_1 [5]
file d
value_1 [7]
value_1 [4]
value_1 [6]
value_1 [9]

Thanks for the help. This will be useful when a write standalone function later on.

Collectives™ on Stack Overflow

Random sampling pandas based on column values

2 Answers 2

4 Comments

Generating Data

Randomly Selecting Files

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Generating Data

Randomly Selecting Files

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related