0

I have a dataframe, let's say:

df = pd.DataFrame({'id': [1, 1 , 2, 2, 2, 3, 4, 5], 'val1': [ 1, 2, 1, 1, 2, 1, 2, 3], 'val2': [3, 3, 4, 4, 4, 3, 4, 4]})

I want to split it into two dataframes,(train, and test) using the values in the id column. The split should be such that in the first dataframe I have 80% of the (unique) ids and in the second dataframe, I have 20% of the ids. The ids should be randomly splitted.

My own attempt:

import random
import pandas as pd
def train_test_split(df, test_size=0.2, prng_seed=None):
    prng = random.Random()
    prng.seed(prng_seed)
    id_list = df['id'].unique().tolist()
    prng.shuffle(id_list)
    id_size = len(id_list)
    test_abs_size =  int(id_size * test_size)
    test_id = id_list[-test_abs_size:]
    train_id = id_list[:-test_abs_size]
    train_data = df[df['id'].isin(train_id)]
    test_data = df[df['id'].isin(test_id)]
    return train_data, test_data
2
  • But the ids occur multiple times? Commented May 21, 2017 at 14:13
  • 1
    It sounds like you want to apply stratification such that the distribution of customer ids is preserved: train_test_split(df, test_size = 0.2, stratify=df.id) Commented May 21, 2017 at 18:08

1 Answer 1

1

The following code splits the dataset into 80-20 train-test sets

import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split

train, test = train_test_split(df, test_size = 0.2)

Per @JanTrienes comment, if you want to preserve the distribution of ids, you can use stratify The following code exectues that:

import pandas as pd
from sklearn.cross_validation import train_test_split

df = pd.DataFrame({'id': [1, 1, 2, 2, 2, 3, 4, 4,
                          1, 1, 2, 2, 2, 3, 4, 4],
                 'val1': [1, 2, 1, 1, 2, 1, 2, 3,
                          1, 2, 1, 1, 2, 1, 2, 3],
                 'val2': [3, 3, 4, 4, 4, 3, 4, 4,
                          3, 3, 4, 4, 4, 3, 4, 4]})

train, test = train_test_split(df, test_size = 0.2, stratify=df.id)

Here is an example of what the output would be:

train:
    id  val1  val2
0    1     1     3
7    4     3     4
15   4     3     4
13   3     1     3
14   4     2     4
11   2     1     4
9    1     2     3
8    1     1     3
12   2     2     4
4    2     2     4
2    2     1     4
5    3     1     3
test:
    id  val1  val2
6    4     2     4
10   2     1     4
1    1     2     3
3    2     1     4
Sign up to request clarification or add additional context in comments.

3 Comments

In my problem setting, I am interested in unique ids, you can think of them as customer ids, I want to train on the data from 80% of the customers and test on the remaining 20% of the customers.
@user128751 is it acceptable to remove duplicate IDs prior to splitting into training and testing sets?
@user128751 does my most recent edit answer your question? If not, could you add an example of the desired output you are looking for?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.