Split a pandas dataframe into two dataframe based on values in a columns

Question

I have a dataframe, let's say:

df = pd.DataFrame({'id': [1, 1 , 2, 2, 2, 3, 4, 5], 'val1': [ 1, 2, 1, 1, 2, 1, 2, 3], 'val2': [3, 3, 4, 4, 4, 3, 4, 4]})

I want to split it into two dataframes,(train, and test) using the values in the id column. The split should be such that in the first dataframe I have 80% of the (unique) ids and in the second dataframe, I have 20% of the ids. The ids should be randomly splitted.

My own attempt:

import random
import pandas as pd
def train_test_split(df, test_size=0.2, prng_seed=None):
    prng = random.Random()
    prng.seed(prng_seed)
    id_list = df['id'].unique().tolist()
    prng.shuffle(id_list)
    id_size = len(id_list)
    test_abs_size =  int(id_size * test_size)
    test_id = id_list[-test_abs_size:]
    train_id = id_list[:-test_abs_size]
    train_data = df[df['id'].isin(train_id)]
    test_data = df[df['id'].isin(test_id)]
    return train_data, test_data

It sounds like you want to apply stratification such that the distribution of customer ids is preserved: train_test_split(df, test_size = 0.2, stratify=df.id) — Jan Trienes
– Jan Trienes, Commented May 21, 2017 at 18:08

Patrick Hingston · Accepted Answer · 2017-05-21 18:42:11Z

1

The following code splits the dataset into 80-20 train-test sets

import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split

train, test = train_test_split(df, test_size = 0.2)

Per @JanTrienes comment, if you want to preserve the distribution of ids, you can use stratify The following code exectues that:

import pandas as pd
from sklearn.cross_validation import train_test_split

df = pd.DataFrame({'id': [1, 1, 2, 2, 2, 3, 4, 4,
                          1, 1, 2, 2, 2, 3, 4, 4],
                 'val1': [1, 2, 1, 1, 2, 1, 2, 3,
                          1, 2, 1, 1, 2, 1, 2, 3],
                 'val2': [3, 3, 4, 4, 4, 3, 4, 4,
                          3, 3, 4, 4, 4, 3, 4, 4]})

train, test = train_test_split(df, test_size = 0.2, stratify=df.id)

Here is an example of what the output would be:

train:
    id  val1  val2
0    1     1     3
7    4     3     4
15   4     3     4
13   3     1     3
14   4     2     4
11   2     1     4
9    1     2     3
8    1     1     3
12   2     2     4
4    2     2     4
2    2     1     4
5    3     1     3
test:
    id  val1  val2
6    4     2     4
10   2     1     4
1    1     2     3
3    2     1     4

edited May 21, 2017 at 18:42

answered May 21, 2017 at 15:52

Patrick Hingston

2921 silver badge7 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user128751 Over a year ago

In my problem setting, I am interested in unique ids, you can think of them as customer ids, I want to train on the data from 80% of the customers and test on the remaining 20% of the customers.

Patrick Hingston Over a year ago

@user128751 is it acceptable to remove duplicate IDs prior to splitting into training and testing sets?

Patrick Hingston Over a year ago

@user128751 does my most recent edit answer your question? If not, could you add an example of the desired output you are looking for?

Collectives™ on Stack Overflow

Split a pandas dataframe into two dataframe based on values in a columns

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related