3

I would like to split a list into 3 sublists (train, validation, test) using pre-defined ratios. The items should be chosen to the sublists randomly and without repetition. (My first list contains the names of images in a folder which I want to process after the splitting.) I found a working method, but it seems complicated. I'm curious is there a simpler way to do this? My method is:

  • list the files in the folder,
  • define the necessary size of sublists,
  • randomly fill in the first sublist,
  • remove the used items from the original list,
  • randomly fill in the second sublist from the remaining list,
  • remove the used items to get the third sublist.

This is my code:

import random
import os 

# list files in folder
files = os.listdir("C:/.../my_folder")

# define the size of the sets: ~30% validation, ~20% test, ~50% training (remaining goes to training set)
validation_count = int(0.3 * len(files))
test_count = int(0.2 * len(files))
training_count = len(files) - validation_count - test_count

# randomly choose ~20% of files to test set
test_set = random.sample(files, k = test_count)

# remove already chosen files from original list
files_wo_test_set = [f for f in files if f not in test_set]

# randomly chose ~30% of remaining files to validation set
validation_set = random.sample(files_wo_test_set, k = validation_count)

# the remaining files going into the training set
training_set = [f for f in files_wo_test_set if f not in validation_set]

5
  • this seems clean enough to me Commented Dec 11, 2020 at 16:03
  • So, what is the problem? Commented Dec 11, 2020 at 16:03
  • @mece1390 the op wants a cleaner way of doing it Commented Dec 11, 2020 at 16:04
  • What is the meaning of cleaner when he/she says I found a working method but it was complicated? What element makes it cleaner? What element makes it complicated? Commented Dec 11, 2020 at 16:10
  • Hello, thanks for the comments. I already got 2 answers which are more simple and elegant in my opinion, this was my goal. Thanks for the answers! Commented Dec 12, 2020 at 10:41

3 Answers 3

4

I think the answer is self explanatory so I am not adding any explanation.

import random
random.shuffle(files)
k = test_count
set1 = files[:k]
set2 = files[k:1.5k]
set3 = files[1.5k:]
Sign up to request clarification or add additional context in comments.

Comments

1

I'd recommend looking into the sci-kit learn library, as that contains the train_test_split function to do this for you. However to answer your question using just the random library.

# First shuffle the list randomly
files = os.listdir("C:/.../my_folder")
random.shuffle(files) 

# Then just slice
ratio = int(len(files)/5) # 20%
test_set = files[:ratio]
val_set = files[ratio:1.5*ratio] #30%

Comments

0

I hope this can help someone. Sklearn has a library that does it easily:

from sklearn.model_selection import train_test_split

X = np.arange(15).reshape((5, 3))
>>> X
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

X_train, X_test =train_test_split(X, test_size=0.3, random_state=42)

>>> X_train
array([[ 6,  7,  8],
       [ 0,  1,  2],
       [ 9, 10, 11]])

>>> X_test
array([[ 3,  4,  5],
       [12, 13, 14]])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.