I have a dataframe, let's say:
df = pd.DataFrame({'id': [1, 1 , 2, 2, 2, 3, 4, 5], 'val1': [ 1, 2, 1, 1, 2, 1, 2, 3], 'val2': [3, 3, 4, 4, 4, 3, 4, 4]})
I want to split it into two dataframes,(train, and test) using the values in the id column. The split should be such that in the first dataframe I have 80% of the (unique) ids and in the second dataframe, I have 20% of the ids. The ids should be randomly splitted.
My own attempt:
import random
import pandas as pd
def train_test_split(df, test_size=0.2, prng_seed=None):
prng = random.Random()
prng.seed(prng_seed)
id_list = df['id'].unique().tolist()
prng.shuffle(id_list)
id_size = len(id_list)
test_abs_size = int(id_size * test_size)
test_id = id_list[-test_abs_size:]
train_id = id_list[:-test_abs_size]
train_data = df[df['id'].isin(train_id)]
test_data = df[df['id'].isin(test_id)]
return train_data, test_data
ids occur multiple times?