How do I create test and train samples from one dataframe with pandas?

Question

I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples (80% and 20%) for training and testing.

Thanks!

o-90 · Accepted Answer · 2022-02-14 16:50:16Z

991

Scikit Learn's train_test_split is a good one. It will split both numpy arrays and dataframes.

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

edited Feb 14, 2022 at 16:50

answered Jun 10, 2014 at 22:19

o-90

17.7k10 gold badges44 silver badges65 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Bar Over a year ago

This will return numpy arrays and not Pandas Dataframes however

Julien Marrec Over a year ago

Btw, it does return a Pandas Dataframe now (just tested on Sklearn 0.16.1)

Mark Over a year ago

In new versions (0.18, maybe earlier), import as from sklearn.model_selection import train_test_split instead.

horseshoe Over a year ago

In the newest SciKit version you need to call it now as: from sklearn.cross_validation import train_test_split

Kingz Over a year ago

@horseshoe the cv module is deprecated:

DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.   "This module will be removed in 0.20.", DeprecationWarning)

|

Andy Hayden · Accepted Answer · 2014-06-11 00:30:42Z

488

I would just use numpy's randn:

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

And just to see this has worked:

In [15]: len(test)
Out[15]: 21

In [16]: len(train)
Out[16]: 79

edited Jun 11, 2014 at 0:30

answered Jun 10, 2014 at 17:29

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

16 Comments

unutbu Over a year ago

Sorry, my mistake. As long as msk is of dtype bool, df[msk], df.iloc[msk] and df.loc[msk] always return the same result.

R. Max Over a year ago

I think you should use rand to < 0.8 make sense because it returns uniformly distributed random numbers between 0 and 1.

kuatroka Over a year ago

Can someone explain purely in python terms what exactly happens in lines in[12], in[13], in[14]? I want to understand the python code itself here

So S Over a year ago

The answer using sklearn from gobrewers14 is the better one. It's less complex and easier to debug. I recommend using the answer below.

Kentzo Over a year ago

@kuatroka np.random.rand(len(df)) is an array of size len(df) with randomly and uniformly distributed float values in range [0, 1]. The < 0.8 applies the comparison element-wise and stores the result in place. Thus values < 0.8 become True and value >= 0.8 become False

|

RajV · Accepted Answer · 2022-11-17 20:40:37Z

448

Pandas random sample will also work

train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)

For the same random_state value you will always get the same exact data in the training and test set. This brings in some level of repeatability while also randomly separating training and test data.

edited Nov 17, 2022 at 20:40

RajV

7,2728 gold badges52 silver badges69 bronze badges

answered Feb 21, 2016 at 1:28

PagMax

8,6588 gold badges28 silver badges41 bronze badges

10 Comments

Rishabh Agrahari Over a year ago

what is random_state arg doing?

MikeL Over a year ago

@RishabhAgrahari randomly shuffles different data split every time according to the frac arg. If you want to control the randomness you can state your own seed, like in the example.

RajV Over a year ago

This seems to work well and a more elegant solution than bringing in sklearn. Is there a reason why this shouldn't be a better accepted answer?

peer Over a year ago

@RajV in its current form test will be randomly selected but rows will be in their original order. The sklearn approach shuffles both train and test.

Alok Lal Over a year ago

@peer that limitation is easily remedied if a shuffled test set is desired as pointed out here stackoverflow.com/questions/29576430/shuffle-dataframe-rows. test=df.drop(train.index).sample(frac=1.0)

|

Sudeepa Nadeeshan · Accepted Answer · 2020-04-20 22:31:00Z

42

I would use scikit-learn's own training_test_split, and generate it from the index

from sklearn.model_selection import train_test_split


y = df.pop('output')
X = df

X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train

edited Apr 20, 2020 at 22:31

Sudeepa Nadeeshan

1822 silver badges14 bronze badges

answered May 26, 2015 at 9:33

Napitupulu Jon

7,8313 gold badges24 silver badges23 bronze badges

2 Comments

Harry Over a year ago

The cross_validation module is now deprecated:

DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.

Sam OT Over a year ago

This gives an error when I do it with a df whose output column is strings. I get TypeError: '<' not supported between instances of 'str' and 'float'. It appears that y needs to be a DataFrame not a Series. Indeed, appending .to_frame() either the definition of y or the argument y in train_test_split works. If you're using stratify = y, you need to make sure that this y is a DataFrame too. If I instead define y = df[["output"]] and X = df.drop("output", axis = 1) then it works too; this is basically the same as appending .to_frame() to the definition of y.

Nosey · Accepted Answer · 2021-01-05 13:37:18Z

34

No need to convert to numpy. Just use a pandas df to do the split and it will return a pandas df.

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

And if you want to split x from y

X_train, X_test, y_train, y_test = train_test_split(df[list_of_x_cols], df[y_col],test_size=0.2)

And if you want to split the whole df

X, y = df[list_of_x_cols], df[y_col]

edited Jan 5, 2021 at 13:37

answered Jun 6, 2020 at 14:47

Nosey

7247 silver badges14 bronze badges

Comments

double-beep · Accepted Answer · 2019-03-31 15:18:14Z

There are many ways to create a train/test and even validation samples.

Case 1: classic way train_test_split without any options:

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)

Case 2: case of a very small datasets (<500 rows): in order to get results for all your lines with this cross-validation. At the end, you will have one prediction for each line of your available training set.

from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
    reg = RandomForestRegressor(n_estimators=50, random_state=0)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = reg.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    y_hat_all.append(y_hat)

Case 3a: Unbalanced datasets for classification purpose. Following the case 1, here is the equivalent solution:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)

Case 3b: Unbalanced datasets for classification purpose. Following the case 2, here is the equivalent solution:

from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
    reg = RandomForestRegressor(n_estimators=50, random_state=0)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = reg.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    y_hat_all.append(y_hat)

Case 4: you need to create a train/test/validation sets on big data to tune hyperparameters (60% train, 20% test and 20% val).

from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)

user1775015 · Accepted Answer · 2018-03-18 14:29:10Z

16

You can use below code to create test and train samples :

from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)

Test size can vary depending on the percentage of data you want to put in your test and train dataset.

answered Mar 18, 2018 at 14:29

user1775015

1871 silver badge6 bronze badges

Comments

Abhi · Accepted Answer · 2016-12-09 22:18:03Z

8

There are many valid answers. Adding one more to the bunch. from sklearn.cross_validation import train_test_split

#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]

answered Dec 9, 2016 at 22:18

Abhi

1,2693 gold badges28 silver badges40 bronze badges

Comments

Apogentus · Accepted Answer · 2014-12-10 23:11:12Z

7

You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

df[train_inds] and df[test_inds] give you the training and testing sets of your original DataFrame df.

answered Dec 10, 2014 at 23:11

Apogentus

6,6536 gold badges36 silver badges34 bronze badges

2 Comments

vincentmajor Over a year ago

This is the preferable strategy for supervised learning tasks.

Markus W Over a year ago

When trying to use this I am getting an error. ValueError: assignment destination is read-only in the line "np.random.shuffle(value_inds)"

Pratik Deoolwadikar · Accepted Answer · 2020-01-26 11:54:43Z

6

You can use ~ (tilde operator) to exclude the rows sampled using df.sample(), letting pandas alone handle sampling and filtering of indexes, to obtain two sets.

train_df = df.sample(frac=0.8, random_state=100)
test_df = df[~df.index.isin(train_df.index)]

answered Jan 26, 2020 at 11:54

Pratik Deoolwadikar

2494 silver badges3 bronze badges

Comments

Liran Orevi · Accepted Answer · 2017-08-17 08:23:27Z

5

Just select range row from df like this

row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]

edited Aug 17, 2017 at 8:23

Liran Orevi

4,9438 gold badges49 silver badges65 bronze badges

answered May 11, 2017 at 2:49

Makio

4837 silver badges15 bronze badges

3 Comments

Emil L Over a year ago

This would only work if the data in the dataframe is already randomly ordered. If the dataset is derived from ultiple sources and has been appended to the same dataframe then it's quite possible to get a very skewed dataset for training/testing using the above.

Makio Over a year ago

You can shuffle dataframe before split it stackoverflow.com/questions/29576430/shuffle-dataframe-rows

Emil L Over a year ago

Absolutelty! If you add that df in your code snippet is (or should be) shuffled it will improve the answer.

MikeL · Accepted Answer · 2017-11-15 09:50:21Z

If you need to split your data with respect to the lables column in your data set you can use this:

def split_to_train_test(df, label_column, train_frac=0.8):
    train_df, test_df = pd.DataFrame(), pd.DataFrame()
    labels = df[label_column].unique()
    for lbl in labels:
        lbl_df = df[df[label_column] == lbl]
        lbl_train_df = lbl_df.sample(frac=train_frac)
        lbl_test_df = lbl_df.drop(lbl_train_df.index)
        print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
        train_df = train_df.append(lbl_train_df)
        test_df = test_df.append(lbl_test_df)

    return train_df, test_df

and use it:

train, test = split_to_train_test(data, 'class', 0.7)

you can also pass random_state if you want to control the split randomness or use some global random seed.

AHonarmand · Accepted Answer · 2020-02-05 22:12:54Z

4

To split into more than two classes such as train, test, and validation, one can do:

probs = np.random.rand(len(df))
training_mask = probs < 0.7
test_mask = (probs>=0.7) & (probs < 0.85)
validatoin_mask = probs >= 0.85


df_training = df[training_mask]
df_test = df[test_mask]
df_validation = df[validatoin_mask]

This will put approximately 70% of data in training, 15% in test, and 15% in validation.

edited Feb 5, 2020 at 22:12

answered Mar 14, 2018 at 17:43

AHonarmand

5401 gold badge8 silver badges18 bronze badges

1 Comment

stason Over a year ago

You might want to edit your answer to add "approximately", if you run the code you will see that it can be quite off from the exact percentage. e.g. I tried it on 1000 items and got: 700, 141, 159 - so 70%, 14% and 16%.

elyte5star · Accepted Answer · 2020-06-17 20:05:06Z

4

shuffle = np.random.permutation(len(df))
test_size = int(len(df) * 0.2)
test_aux = shuffle[:test_size]
train_aux = shuffle[test_size:]
TRAIN_DF =df.iloc[train_aux]
TEST_DF = df.iloc[test_aux]

answered Jun 17, 2020 at 20:05

elyte5star

3412 silver badges8 bronze badges

4 Comments

Perry Over a year ago

This would be a better answer if you explained how the code you provided answers the question.

shaunakde Over a year ago

While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value.

elyte5star Over a year ago

the first line returns a shuffled range(with respect to the size of the dataframe).The second line represents the desired fraction of the test set.The third and forth line incorporates the fraction into the shuffled range.The rest lines should be self explanatory.Regards.

Sheece Gardazi Over a year ago

Adding this explanation to the answer itself will be optimal :)

Pardhu Gopalam · Accepted Answer · 2018-07-10 17:40:45Z

3

import pandas as pd

from sklearn.model_selection import train_test_split

datafile_name = 'path_to_data_file'

data = pd.read_csv(datafile_name)

target_attribute = data['column_name']

X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)

edited Jul 10, 2018 at 17:40

answered Jul 9, 2018 at 9:36

Pardhu Gopalam

1891 silver badge6 bronze badges

1 Comment

Anton Erjomin Over a year ago

You have a short mistake. You should drop target column before, you put it into train_test_split. data = data.drop(columns = ['column_name'], axis = 1)

Anarcho-Chossid · Accepted Answer · 2014-12-25 20:59:11Z

2

This is what I wrote when I needed to split a DataFrame. I considered using Andy's approach above, but didn't like that I could not control the size of the data sets exactly (i.e., it would be sometimes 79, sometimes 81, etc.).

def make_sets(data_df, test_portion):
    import random as rnd

    tot_ix = range(len(data_df))
    test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
    train_ix = list(set(tot_ix) ^ set(test_ix))

    test_df = data_df.ix[test_ix]
    train_df = data_df.ix[train_ix]

    return train_df, test_df


train_df, test_df = make_sets(data_df, 0.2)
test_df.head()

edited Dec 25, 2014 at 20:59

answered Dec 25, 2014 at 20:52

Anarcho-Chossid

2,3605 gold badges34 silver badges46 bronze badges

Comments

biendltb · Accepted Answer · 2019-11-19 06:00:45Z

There are many great answers above so I just wanna add one more example in the case that you want to specify the exact number of samples for the train and test sets by using just the numpy library.

# set the random seed for the reproducibility
np.random.seed(17)

# e.g. number of samples for the training set is 1000
n_train = 1000

# shuffle the indexes
shuffled_indexes = np.arange(len(data_df))
np.random.shuffle(shuffled_indexes)

# use 'n_train' samples for training and the rest for testing
train_ids = shuffled_indexes[:n_train]
test_ids = shuffled_indexes[n_train:]

train_data = data_df.iloc[train_ids]
train_labels = labels_df.iloc[train_ids]

test_data = data_df.iloc[test_ids]
test_labels = data_df.iloc[test_ids]

otto · Accepted Answer · 2021-06-17 13:24:27Z

2

if you want to split it to train, test and validation set you can use this function:

from sklearn.model_selection import train_test_split
import pandas as pd

def train_test_val_split(df, test_size=0.15, val_size=0.45):
    temp, test = train_test_split(df, test_size=test_size)
    total_items_count = len(df.index)
    val_length = total_items_count * val_size
    new_val_propotion = val_length / len(temp.index) 
    train, val = train_test_split(temp, test_size=new_val_propotion)
    return train, test, val

answered Jun 17, 2021 at 13:24

otto

2,0758 gold badges43 silver badges69 bronze badges

Comments

umair mughal · Accepted Answer · 2022-11-03 05:16:06Z

2

The sample method selects a part of data, you can shuffle the data first by passing a seed value.

train = df.sample(frac=0.8, random_state=42)

For test set you can drop the rows through indexes of train DF and then reset the index of new DF.

test = df.drop(train_data.index).reset_index(drop=True)

edited Nov 3, 2022 at 5:16

answered Nov 2, 2022 at 6:31

umair mughal

947 bronze badges

4 Comments

Adriaan Over a year ago

Please read How to Answer and edit your answer to contain an explanation as to why this code would actually solve the problem at hand. Always remember that you're not only solving the problem, but are also educating the OP and any future readers of this post.

umair mughal Over a year ago

I think it's self explanatory. OP asked for splitting df into train and test, which these two variables represents. I'll still read the linked doc though. Thanks

Adriaan Over a year ago

The mere fact that the OP asked about this shows they don't have a complete understanding of Pandas, which on its own is enough to merit an explanation as to why this works.

chrslg Over a year ago

But that is a clone of an already existing and highly upvoted answer. Please, when answering to old questions, be sure to bring new information that was not present in previous answers (for example, because of technical changes since), and to explicitly make clear what is new.

Johnny V · Accepted Answer · 2015-07-19 21:29:26Z

1

If your wish is to have one dataframe in and two dataframes out (not numpy arrays), this should do the trick:

def split_data(df, train_perc = 0.8):

   df['train'] = np.random.rand(len(df)) < train_perc

   train = df[df.train == 1]

   test = df[df.train == 0]

   split_data ={'train': train, 'test': test}

   return split_data

answered Jul 19, 2015 at 21:29

Johnny V

1,23817 silver badges23 bronze badges

Comments

Hakim · Accepted Answer · 2015-08-04 04:16:06Z

1

I think you also need to a get a copy not a slice of dataframe if you wanna add columns later.

msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)

answered Aug 4, 2015 at 4:16

Hakim

1,3261 gold badge11 silver badges23 bronze badges

Comments

kiran6 · Accepted Answer · 2015-11-27 08:50:52Z

1

You can make use of df.as_matrix() function and create Numpy-array and pass it.

Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)

answered Nov 27, 2015 at 8:50

kiran6

1,3052 gold badges13 silver badges21 bronze badges

Comments

thebeancounter · Accepted Answer · 2018-10-09 09:19:00Z

1

A bit more elegant to my taste is to create a random column and then split by it, this way we can get a split that will suit our needs and will be random.

def split_df(df, p=[0.8, 0.2]):
import numpy as np
df["rand"]=np.random.choice(len(p), len(df), p=p)
r = [df[df["rand"]==val] for val in df["rand"].unique()]
return r

edited Oct 9, 2018 at 9:19

answered Oct 9, 2018 at 9:08

thebeancounter

4,85911 gold badges72 silver badges121 bronze badges

Comments

Shaina Raza · Accepted Answer · 2020-03-30 20:27:44Z

1

you need to convert pandas dataframe into numpy array and then convert numpy array back to dataframe

 import pandas as pd
df=pd.read_csv('/content/drive/My Drive/snippet.csv', sep='\t')
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)
train1=pd.DataFrame(train)
test1=pd.DataFrame(test)
train1.to_csv('/content/drive/My Drive/train.csv',sep="\t",header=None, encoding='utf-8', index = False)
test1.to_csv('/content/drive/My Drive/test.csv',sep="\t",header=None, encoding='utf-8', index = False)

edited Mar 30, 2020 at 20:27

answered Mar 30, 2020 at 16:57

Shaina Raza

1,67820 silver badges13 bronze badges

3 Comments

Luvexina Over a year ago

Code-only answers aren't acceptable on Stack Overflow.

Nosey Over a year ago

Converting to numpy is not needed, and is not actually performed in this code.

schro Over a year ago

btw -- it does return a dataframe now!

Aaditya Ura · Accepted Answer · 2020-12-20 09:06:03Z

1

In my case, I wanted to split a data frame in Train, test and dev with a specific number. Here I am sharing my solution

First, assign a unique id to a dataframe (if already not exist)

import uuid
df['id'] = [uuid.uuid4() for i in range(len(df))]

Here are my split numbers:

train = 120765
test  = 4134
dev   = 2816

The split function

def df_split(df, n):
    
    first  = df.sample(n)
    second = df[~df.id.isin(list(first['id']))]
    first.reset_index(drop=True, inplace = True)
    second.reset_index(drop=True, inplace = True)
    return first, second

Now splitting into train, test, dev

train, test = df_split(df, 120765)
test, dev   = df_split(test, 4134)

answered Dec 20, 2020 at 9:06

Aaditya Ura

12.8k7 gold badges60 silver badges96 bronze badges

1 Comment

Allohvk Over a year ago

resetting index is important if you are using datasets and dataloaders or even otherwise it is a good convention. This is the only answer that talks of reindexing.

Nathan G · Accepted Answer · 2023-02-23 12:17:56Z

1

That's what I do:

train_dataset = dataset.sample(frac=0.80, random_state=200)
val_dataset = dataset.drop(train_dataset.index).sample(frac=1.00, random_state=200, ignore_index = True).copy()
train_dataset = train_dataset.sample(frac=1.00, random_state=200, ignore_index = True).copy()
del dataset

answered Feb 23, 2023 at 12:17

Nathan G

1,8791 gold badge20 silver badges18 bronze badges

Comments

Dima · Accepted Answer · 2023-10-30 15:36:06Z

I do this in 2 ways.
Method 1:

from sklearn.model_selection import train_test_split
#Split the dataset into X and y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Method 2:

from sklearn.model_selection import train_test_split
#Split the dataset into X and y
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

Also for larger dataframes, please check out Intel® Distribution of Modin* instead of pandas (https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-of-modin.html#gs.1dtwen) and Intel® Extension for Scikit-learn* (https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html#gs.1dtvml). These framework optimizations will help to accelerate performance on Intel hardware.

Akash Jain · Accepted Answer · 2016-10-13 16:34:46Z

0

How about this? df is my dataframe

total_size=len(df)

train_size=math.floor(0.66*total_size) (2/3 part of my dataset)

#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)

answered Oct 13, 2016 at 16:34

Akash Jain

2775 silver badges10 bronze badges

Comments

Anshuman Tekriwal · Accepted Answer · 2021-12-27 06:16:40Z

0

I would use K-fold cross validation. It's been proven to give much better results than the train_test_split Here's an article on how to apply it with sklearn from the documentation itself: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

answered Dec 27, 2021 at 6:16

Anshuman Tekriwal

1561 silver badge10 bronze badges

Comments

BSalita · Accepted Answer · 2022-11-20 16:01:58Z

Split df into train, validate, test. Given a df of augmented data, select only the dependent and independent columns. Assign 10% of most recent rows (using 'dates' column) to test_df. Randomly assign 10% of remaining rows to validate_df with rest being assigned to train_df. Do not reindex. Check that all rows are uniquely assigned. Use only native python and pandas libs.

Method 1: Split rows into train, validate, test dataframes.

train_df = augmented_df[dependent_and_independent_columns]
test_df = train_df.sort_values('dates').tail(int(len(augmented_df)*0.1)) # select latest 10% of dates for test data
train_df = train_df.drop(test_df.index) # drop rows assigned to test_df
validate_df = train_df.sample(frac=0.1) # randomly assign 10%
train_df = train_df.drop(validate_df.index) # drop rows assigned to validate_df
assert len(augmented_df) == len(set(train_df.index).union(validate_df.index).union(test_df.index)) # every row must be uniquely assigned to a df

Method 2: Split rows when validate must be subset of train (fastai)

train_validate_test_df = augmented_df[dependent_and_independent_columns]
test_df = train_validate_test_df.loc[augmented_df.sort_values('dates').tail(int(len(augmented_df)*0.1)).index] # select latest 10% of dates for test data
train_validate_df = train_validate_test_df.drop(test_df.index) # drop rows assigned to test_df
validate_df = train_validate_df.sample(frac=validate_ratio) # assign 10% to validate_df
train_df = train_validate_df.drop(validate_df.index) # drop rows assigned to validate_df
assert len(augmented_df) == len(set(train_df.index).union(validate_df.index).union(test_df.index)) # every row must be uniquely assigned to a df
# fastai example usage
dls = fastai.tabular.all.TabularDataLoaders.from_df(
train_validate_df, valid_idx=train_validate_df.index.get_indexer_for(validate_df.index))

Collectives™ on Stack Overflow

How do I create test and train samples from one dataframe with pandas?

30 Answers 30

7 Comments

16 Comments

10 Comments

2 Comments

Comments

Comments

Comments

Comments

2 Comments

Comments

3 Comments

Comments

1 Comment

4 Comments

1 Comment

Comments

Comments

Comments

4 Comments

Comments

Comments

Comments

Comments

3 Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

30 Answers 30

7 Comments

16 Comments

10 Comments

2 Comments

Comments

Comments

Comments

Comments

2 Comments

Comments

3 Comments

Comments

1 Comment

4 Comments

1 Comment

Comments

Comments

Comments

4 Comments

Comments

Comments

Comments

Comments

3 Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Linked

Related