3

I have files (A,B,C etc) each having 12,000 data points. I have divided the files into batches of 1000 points and computed the value for each batch. So now for each file we have 12 values, which is loaded in a pandas Data Frame (shown below).

    file    value_1     value_2
0   A           1           43
1   A           1           89
2   A           1           22
3   A           1           87
4   A           1           43
5   A           1           89
6   A           1           22
7   A           1           87
8   A           1           43
9   A           1           89
10  A           1           22
11  A           1           87
12  A           1           83
13  B           0           99
14  B           0           23
15  B           0           29
16  B           0           34
17  B           0           99
18  B           0           23
19  B           0           29
20  B           0           34
21  B           0           99
22  B           0           23
23  B           0           29
24  B           0           34
25  C           1           62
-   -           -           -
-   -           -           -

Now as the next step I need to randomly select a file, and for that file randomly select a sequence of 4 batches for value_1. The later, I believe can be done with df.sample(), but I'm not sure how to randomly select the files. I tried to make it work with np.random.choice(data['file'].unique()), but doesn't seems correct.

Thanks for the help in advance. I'm pretty new to pandas and python in general.

5
  • Your files are a list of dataframes? Commented Sep 3, 2017 at 22:25
  • My original files are ascii (.mat) files. I extract the values from the batches and save it to a pandas dataframe similar to the one above. Commented Sep 3, 2017 at 22:27
  • 1
    Try data[data.file == np.random.choice(data['file'].unique())].sample(n=4). If that does not get you the desired output, then edit the question to add your expected output. Commented Sep 3, 2017 at 22:32
  • Thanks. But the desired output should be a random sequence of 4 batches. The starting point for the sequence will be random, but the values will be 4 consecutive batches. I'll update this on the question, thanks for the suggestion. Commented Sep 3, 2017 at 22:41
  • store all the df into a list , and get the random number , select df . Commented Sep 4, 2017 at 2:56

2 Answers 2

5

If I understand what you are trying to get at, the following should be of help:

# Test dataframe
import numpy as np
import pandas as pd


data = pd.DataFrame({'file': np.repeat(['A', 'B', 'C'], 12),
                     'value_1': np.repeat([1,0,1],12),
                     'value_2': np.random.randint(20, 100, 36)})
# Select a file
data1 = data[data.file == np.random.choice(data['file'].unique())].reset_index(drop=True)

# Get a random index from data1
start_ix = np.random.choice(data1.index[:-3])

# Get a sequence starting at the random index from the previous step
print(data.loc[start_ix:start_ix+3])
Sign up to request clarification or add additional context in comments.

4 Comments

This is exactly what I need. One small problem is that I'm getting a KeyError for data.loc[start_ix:start_ix+3].
@RnK, what is the value of start_ix when you encounter a KeyError exception. I just tested this 2000 times on a sample dataframe similar to the one in your question but I am getting no KeyError exception.
@RnK try resetting the index of data1 before using .loc. If this does not work, please make sure to add the data you're working with in your question. I added the dataframe I am working with.
Sorry it was my mistake as I did not reset the index. Thanks for the help.
3

Here's a rather long winded answer that has a lot of flexibility and uses some random data I generated. I also added a field to the dataframe to denote whether that row had been used.

Generating Data

import pandas as pd
from string import ascii_lowercase
import random

random.seed(44)

files = [ascii_lowercase[i] for i in range(4)]
value_1 = random.sample(range(1, 10), 8)

files_df = files*len(value_1)
value_1_df = value_1*len(files)
value_1_df.sort()
value_2_df = random.sample(range(100, 200), len(files_df))

df = pd.DataFrame({'file' : files_df,
                 'value_1': value_1_df,
                 'value_2': value_2_df,
                  'used': 0})

Randomly Selecting Files

len_to_run = 3 #change to run for however long you'd like
batch_to_pull = 4
updated_files = df.loc[df.used==0,'file'].unique()

for i in range(len_to_run): #not needed if you only want to run once
    file_to_pull = ''.join(random.sample(updated_files, 1))
    print 'file ' + file_to_pull
    for j in range(batch_to_pull): #pulling 4 values
        updated_value_1 = df.loc[(df.used==0) & (df.file==file_to_pull),'value_1'].unique()
        value_1_to_pull = random.sample(updated_value_1,1)
        print 'value_1 ' + str(value_1_to_pull)
        df.loc[(df.file == file_to_pull) & (df.value_1==value_1_to_pull),'used']=1

file a
value_1 [1]
value_1 [7]
value_1 [5]
value_1 [4]
file d
value_1 [3]
value_1 [2]
value_1 [1]
value_1 [5]
file d
value_1 [7]
value_1 [4]
value_1 [6]
value_1 [9]

1 Comment

Thanks for the help. This will be useful when a write standalone function later on.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.