565

I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples (80% and 20%) for training and testing.

Thanks!

30 Answers 30

991

Scikit Learn's train_test_split is a good one. It will split both numpy arrays and dataframes.

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)
Sign up to request clarification or add additional context in comments.

7 Comments

This will return numpy arrays and not Pandas Dataframes however
Btw, it does return a Pandas Dataframe now (just tested on Sklearn 0.16.1)
In new versions (0.18, maybe earlier), import as from sklearn.model_selection import train_test_split instead.
In the newest SciKit version you need to call it now as: from sklearn.cross_validation import train_test_split
@horseshoe the cv module is deprecated: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning)
|
488

I would just use numpy's randn:

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

And just to see this has worked:

In [15]: len(test)
Out[15]: 21

In [16]: len(train)
Out[16]: 79

16 Comments

Sorry, my mistake. As long as msk is of dtype bool, df[msk], df.iloc[msk] and df.loc[msk] always return the same result.
I think you should use rand to < 0.8 make sense because it returns uniformly distributed random numbers between 0 and 1.
Can someone explain purely in python terms what exactly happens in lines in[12], in[13], in[14]? I want to understand the python code itself here
The answer using sklearn from gobrewers14 is the better one. It's less complex and easier to debug. I recommend using the answer below.
@kuatroka np.random.rand(len(df)) is an array of size len(df) with randomly and uniformly distributed float values in range [0, 1]. The < 0.8 applies the comparison element-wise and stores the result in place. Thus values < 0.8 become True and value >= 0.8 become False
|
448

Pandas random sample will also work

train=df.sample(frac=0.8,random_state=200)
test=df.drop(train.index)

For the same random_state value you will always get the same exact data in the training and test set. This brings in some level of repeatability while also randomly separating training and test data.

10 Comments

what is random_state arg doing?
@RishabhAgrahari randomly shuffles different data split every time according to the frac arg. If you want to control the randomness you can state your own seed, like in the example.
This seems to work well and a more elegant solution than bringing in sklearn. Is there a reason why this shouldn't be a better accepted answer?
@RajV in its current form test will be randomly selected but rows will be in their original order. The sklearn approach shuffles both train and test.
@peer that limitation is easily remedied if a shuffled test set is desired as pointed out here stackoverflow.com/questions/29576430/shuffle-dataframe-rows. test=df.drop(train.index).sample(frac=1.0)
|
42

I would use scikit-learn's own training_test_split, and generate it from the index

from sklearn.model_selection import train_test_split


y = df.pop('output')
X = df

X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2)
X.iloc[X_train] # return dataframe train

2 Comments

The cross_validation module is now deprecated: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
This gives an error when I do it with a df whose output column is strings. I get TypeError: '<' not supported between instances of 'str' and 'float'. It appears that y needs to be a DataFrame not a Series. Indeed, appending .to_frame() either the definition of y or the argument y in train_test_split works. If you're using stratify = y, you need to make sure that this y is a DataFrame too. If I instead define y = df[["output"]] and X = df.drop("output", axis = 1) then it works too; this is basically the same as appending .to_frame() to the definition of y.
34

No need to convert to numpy. Just use a pandas df to do the split and it will return a pandas df.

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)

And if you want to split x from y

X_train, X_test, y_train, y_test = train_test_split(df[list_of_x_cols], df[y_col],test_size=0.2)

And if you want to split the whole df

X, y = df[list_of_x_cols], df[y_col]

Comments

28

There are many ways to create a train/test and even validation samples.

Case 1: classic way train_test_split without any options:

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)

Case 2: case of a very small datasets (<500 rows): in order to get results for all your lines with this cross-validation. At the end, you will have one prediction for each line of your available training set.

from sklearn.model_selection import KFold
kf = KFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
    reg = RandomForestRegressor(n_estimators=50, random_state=0)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = reg.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    y_hat_all.append(y_hat)

Case 3a: Unbalanced datasets for classification purpose. Following the case 1, here is the equivalent solution:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)

Case 3b: Unbalanced datasets for classification purpose. Following the case 2, here is the equivalent solution:

from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=10, random_state=0)
y_hat_all = []
for train_index, test_index in kf.split(X, y):
    reg = RandomForestRegressor(n_estimators=50, random_state=0)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = reg.fit(X_train, y_train)
    y_hat = clf.predict(X_test)
    y_hat_all.append(y_hat)

Case 4: you need to create a train/test/validation sets on big data to tune hyperparameters (60% train, 20% test and 20% val).

from sklearn.model_selection import train_test_split
X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6)
X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)

Comments

16

You can use below code to create test and train samples :

from sklearn.model_selection import train_test_split
trainingSet, testSet = train_test_split(df, test_size=0.2)

Test size can vary depending on the percentage of data you want to put in your test and train dataset.

Comments

8

There are many valid answers. Adding one more to the bunch. from sklearn.cross_validation import train_test_split

#gets a random 80% of the entire set
X_train = X.sample(frac=0.8, random_state=1)
#gets the left out portion of the dataset
X_test = X.loc[~df_model.index.isin(X_train.index)]

Comments

7

You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

df[train_inds] and df[test_inds] give you the training and testing sets of your original DataFrame df.

2 Comments

This is the preferable strategy for supervised learning tasks.
When trying to use this I am getting an error. ValueError: assignment destination is read-only in the line "np.random.shuffle(value_inds)"
6

You can use ~ (tilde operator) to exclude the rows sampled using df.sample(), letting pandas alone handle sampling and filtering of indexes, to obtain two sets.

train_df = df.sample(frac=0.8, random_state=100)
test_df = df[~df.index.isin(train_df.index)]

Comments

5

Just select range row from df like this

row_count = df.shape[0]
split_point = int(row_count*1/5)
test_data, train_data = df[:split_point], df[split_point:]

3 Comments

This would only work if the data in the dataframe is already randomly ordered. If the dataset is derived from ultiple sources and has been appended to the same dataframe then it's quite possible to get a very skewed dataset for training/testing using the above.
You can shuffle dataframe before split it stackoverflow.com/questions/29576430/shuffle-dataframe-rows
Absolutelty! If you add that df in your code snippet is (or should be) shuffled it will improve the answer.
4

If you need to split your data with respect to the lables column in your data set you can use this:

def split_to_train_test(df, label_column, train_frac=0.8):
    train_df, test_df = pd.DataFrame(), pd.DataFrame()
    labels = df[label_column].unique()
    for lbl in labels:
        lbl_df = df[df[label_column] == lbl]
        lbl_train_df = lbl_df.sample(frac=train_frac)
        lbl_test_df = lbl_df.drop(lbl_train_df.index)
        print '\n%s:\n---------\ntotal:%d\ntrain_df:%d\ntest_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df))
        train_df = train_df.append(lbl_train_df)
        test_df = test_df.append(lbl_test_df)

    return train_df, test_df

and use it:

train, test = split_to_train_test(data, 'class', 0.7)

you can also pass random_state if you want to control the split randomness or use some global random seed.

Comments

4

To split into more than two classes such as train, test, and validation, one can do:

probs = np.random.rand(len(df))
training_mask = probs < 0.7
test_mask = (probs>=0.7) & (probs < 0.85)
validatoin_mask = probs >= 0.85


df_training = df[training_mask]
df_test = df[test_mask]
df_validation = df[validatoin_mask]

This will put approximately 70% of data in training, 15% in test, and 15% in validation.

1 Comment

You might want to edit your answer to add "approximately", if you run the code you will see that it can be quite off from the exact percentage. e.g. I tried it on 1000 items and got: 700, 141, 159 - so 70%, 14% and 16%.
4
shuffle = np.random.permutation(len(df))
test_size = int(len(df) * 0.2)
test_aux = shuffle[:test_size]
train_aux = shuffle[test_size:]
TRAIN_DF =df.iloc[train_aux]
TEST_DF = df.iloc[test_aux]

4 Comments

This would be a better answer if you explained how the code you provided answers the question.
While this code may answer the question, providing additional context regarding how and/or why it solves the problem would improve the answer's long-term value.
the first line returns a shuffled range(with respect to the size of the dataframe).The second line represents the desired fraction of the test set.The third and forth line incorporates the fraction into the shuffled range.The rest lines should be self explanatory.Regards.
Adding this explanation to the answer itself will be optimal :)
3
import pandas as pd

from sklearn.model_selection import train_test_split

datafile_name = 'path_to_data_file'

data = pd.read_csv(datafile_name)

target_attribute = data['column_name']

X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)

1 Comment

You have a short mistake. You should drop target column before, you put it into train_test_split. data = data.drop(columns = ['column_name'], axis = 1)
2

This is what I wrote when I needed to split a DataFrame. I considered using Andy's approach above, but didn't like that I could not control the size of the data sets exactly (i.e., it would be sometimes 79, sometimes 81, etc.).

def make_sets(data_df, test_portion):
    import random as rnd

    tot_ix = range(len(data_df))
    test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df))))
    train_ix = list(set(tot_ix) ^ set(test_ix))

    test_df = data_df.ix[test_ix]
    train_df = data_df.ix[train_ix]

    return train_df, test_df


train_df, test_df = make_sets(data_df, 0.2)
test_df.head()

Comments

2

There are many great answers above so I just wanna add one more example in the case that you want to specify the exact number of samples for the train and test sets by using just the numpy library.

# set the random seed for the reproducibility
np.random.seed(17)

# e.g. number of samples for the training set is 1000
n_train = 1000

# shuffle the indexes
shuffled_indexes = np.arange(len(data_df))
np.random.shuffle(shuffled_indexes)

# use 'n_train' samples for training and the rest for testing
train_ids = shuffled_indexes[:n_train]
test_ids = shuffled_indexes[n_train:]

train_data = data_df.iloc[train_ids]
train_labels = labels_df.iloc[train_ids]

test_data = data_df.iloc[test_ids]
test_labels = data_df.iloc[test_ids]

Comments

2

if you want to split it to train, test and validation set you can use this function:

from sklearn.model_selection import train_test_split
import pandas as pd

def train_test_val_split(df, test_size=0.15, val_size=0.45):
    temp, test = train_test_split(df, test_size=test_size)
    total_items_count = len(df.index)
    val_length = total_items_count * val_size
    new_val_propotion = val_length / len(temp.index) 
    train, val = train_test_split(temp, test_size=new_val_propotion)
    return train, test, val

Comments

2

The sample method selects a part of data, you can shuffle the data first by passing a seed value.

train = df.sample(frac=0.8, random_state=42)

For test set you can drop the rows through indexes of train DF and then reset the index of new DF.

test = df.drop(train_data.index).reset_index(drop=True)

4 Comments

Please read How to Answer and edit your answer to contain an explanation as to why this code would actually solve the problem at hand. Always remember that you're not only solving the problem, but are also educating the OP and any future readers of this post.
I think it's self explanatory. OP asked for splitting df into train and test, which these two variables represents. I'll still read the linked doc though. Thanks
The mere fact that the OP asked about this shows they don't have a complete understanding of Pandas, which on its own is enough to merit an explanation as to why this works.
But that is a clone of an already existing and highly upvoted answer. Please, when answering to old questions, be sure to bring new information that was not present in previous answers (for example, because of technical changes since), and to explicitly make clear what is new.
1

If your wish is to have one dataframe in and two dataframes out (not numpy arrays), this should do the trick:

def split_data(df, train_perc = 0.8):

   df['train'] = np.random.rand(len(df)) < train_perc

   train = df[df.train == 1]

   test = df[df.train == 0]

   split_data ={'train': train, 'test': test}

   return split_data

Comments

1

I think you also need to a get a copy not a slice of dataframe if you wanna add columns later.

msk = np.random.rand(len(df)) < 0.8
train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)

Comments

1

You can make use of df.as_matrix() function and create Numpy-array and pass it.

Y = df.pop()
X = df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
model.fit(x_train, y_train)
model.test(x_test)

Comments

1

A bit more elegant to my taste is to create a random column and then split by it, this way we can get a split that will suit our needs and will be random.

def split_df(df, p=[0.8, 0.2]):
import numpy as np
df["rand"]=np.random.choice(len(p), len(df), p=p)
r = [df[df["rand"]==val] for val in df["rand"].unique()]
return r

Comments

1

you need to convert pandas dataframe into numpy array and then convert numpy array back to dataframe

 import pandas as pd
df=pd.read_csv('/content/drive/My Drive/snippet.csv', sep='\t')
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2)
train1=pd.DataFrame(train)
test1=pd.DataFrame(test)
train1.to_csv('/content/drive/My Drive/train.csv',sep="\t",header=None, encoding='utf-8', index = False)
test1.to_csv('/content/drive/My Drive/test.csv',sep="\t",header=None, encoding='utf-8', index = False)

3 Comments

Code-only answers aren't acceptable on Stack Overflow.
Converting to numpy is not needed, and is not actually performed in this code.
btw -- it does return a dataframe now!
1

In my case, I wanted to split a data frame in Train, test and dev with a specific number. Here I am sharing my solution

First, assign a unique id to a dataframe (if already not exist)

import uuid
df['id'] = [uuid.uuid4() for i in range(len(df))]

Here are my split numbers:

train = 120765
test  = 4134
dev   = 2816

The split function

def df_split(df, n):
    
    first  = df.sample(n)
    second = df[~df.id.isin(list(first['id']))]
    first.reset_index(drop=True, inplace = True)
    second.reset_index(drop=True, inplace = True)
    return first, second

Now splitting into train, test, dev

train, test = df_split(df, 120765)
test, dev   = df_split(test, 4134)

1 Comment

resetting index is important if you are using datasets and dataloaders or even otherwise it is a good convention. This is the only answer that talks of reindexing.
1

That's what I do:

train_dataset = dataset.sample(frac=0.80, random_state=200)
val_dataset = dataset.drop(train_dataset.index).sample(frac=1.00, random_state=200, ignore_index = True).copy()
train_dataset = train_dataset.sample(frac=1.00, random_state=200, ignore_index = True).copy()
del dataset

Comments

1

I do this in 2 ways.
Method 1:

from sklearn.model_selection import train_test_split
#Split the dataset into X and y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Method 2:

from sklearn.model_selection import train_test_split
#Split the dataset into X and y
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

Also for larger dataframes, please check out Intel® Distribution of Modin* instead of pandas (https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-of-modin.html#gs.1dtwen) and Intel® Extension for Scikit-learn* (https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html#gs.1dtvml). These framework optimizations will help to accelerate performance on Intel hardware.

Comments

0

How about this? df is my dataframe

total_size=len(df)

train_size=math.floor(0.66*total_size) (2/3 part of my dataset)

#training dataset
train=df.head(train_size)
#test dataset
test=df.tail(len(df) -train_size)

Comments

0

I would use K-fold cross validation. It's been proven to give much better results than the train_test_split Here's an article on how to apply it with sklearn from the documentation itself: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

Comments

0

Split df into train, validate, test. Given a df of augmented data, select only the dependent and independent columns. Assign 10% of most recent rows (using 'dates' column) to test_df. Randomly assign 10% of remaining rows to validate_df with rest being assigned to train_df. Do not reindex. Check that all rows are uniquely assigned. Use only native python and pandas libs.

Method 1: Split rows into train, validate, test dataframes.

train_df = augmented_df[dependent_and_independent_columns]
test_df = train_df.sort_values('dates').tail(int(len(augmented_df)*0.1)) # select latest 10% of dates for test data
train_df = train_df.drop(test_df.index) # drop rows assigned to test_df
validate_df = train_df.sample(frac=0.1) # randomly assign 10%
train_df = train_df.drop(validate_df.index) # drop rows assigned to validate_df
assert len(augmented_df) == len(set(train_df.index).union(validate_df.index).union(test_df.index)) # every row must be uniquely assigned to a df

Method 2: Split rows when validate must be subset of train (fastai)

train_validate_test_df = augmented_df[dependent_and_independent_columns]
test_df = train_validate_test_df.loc[augmented_df.sort_values('dates').tail(int(len(augmented_df)*0.1)).index] # select latest 10% of dates for test data
train_validate_df = train_validate_test_df.drop(test_df.index) # drop rows assigned to test_df
validate_df = train_validate_df.sample(frac=validate_ratio) # assign 10% to validate_df
train_df = train_validate_df.drop(validate_df.index) # drop rows assigned to validate_df
assert len(augmented_df) == len(set(train_df.index).union(validate_df.index).union(test_df.index)) # every row must be uniquely assigned to a df
# fastai example usage
dls = fastai.tabular.all.TabularDataLoaders.from_df(
train_validate_df, valid_idx=train_validate_df.index.get_indexer_for(validate_df.index))

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.