I would like to split a list into 3 sublists (train, validation, test) using pre-defined ratios. The items should be chosen to the sublists randomly and without repetition. (My first list contains the names of images in a folder which I want to process after the splitting.) I found a working method, but it seems complicated. I'm curious is there a simpler way to do this? My method is:
- list the files in the folder,
- define the necessary size of sublists,
- randomly fill in the first sublist,
- remove the used items from the original list,
- randomly fill in the second sublist from the remaining list,
- remove the used items to get the third sublist.
This is my code:
import random
import os
# list files in folder
files = os.listdir("C:/.../my_folder")
# define the size of the sets: ~30% validation, ~20% test, ~50% training (remaining goes to training set)
validation_count = int(0.3 * len(files))
test_count = int(0.2 * len(files))
training_count = len(files) - validation_count - test_count
# randomly choose ~20% of files to test set
test_set = random.sample(files, k = test_count)
# remove already chosen files from original list
files_wo_test_set = [f for f in files if f not in test_set]
# randomly chose ~30% of remaining files to validation set
validation_set = random.sample(files_wo_test_set, k = validation_count)
# the remaining files going into the training set
training_set = [f for f in files_wo_test_set if f not in validation_set]