1

I have a cat and dog image dataset. I converted into two folders (cat and dog) each folder contains roughly 10000 images. So Far I don't want 10000 images, I need only 2000 images in each folder. How to automate this in python.

I know to delete a file X, I could use os.remove(X) similarly to delete a folder os.rmdir(dir_)

But I'm wondering how could i delete randomly n files in each folder effectively

So Far I tried,

dogs_dir=os.listdir('dogs')
cats_dir=os.listdir('cats')

selected_dogs = np.random.choice(dogs_dir,8000)
selected_cats = np.random.choice(cats_dir,8000)

for file_ in selected_dogs:
    os.remove('dogs/'+file_)

for file_ in selected_cats:
    os.remove('cats/'+file_)    

The above code does the job for me, but I'm wondering is their effective way so that i could remove complexity in my code.

Any help would be appreciable.

I'm using ubuntu 17.10, For Now linux based solution is sufficient, but If it compatible with windows also then it's more appreciable.

8
  • Please clarify: a) OS: Linux, Windows, you need universal solution, b) are you aware of shutil ? Commented Feb 19, 2019 at 9:57
  • Your code seems reasonable. The only things I'd say are: 1) np.random.choice samples with replacement by default, pass replace=False to avoid picking the same file twice 2) If you want, you can avoid using NumPy for this task by just using random.sample. Commented Feb 19, 2019 at 9:59
  • @AlexYu - updated to the question Commented Feb 19, 2019 at 10:00
  • You can also consider moving 8000 files to another directory and then deleting that entire directory Commented Feb 19, 2019 at 10:03
  • @jdehesa - I'll take this in mind, thanks for the advice. Commented Feb 19, 2019 at 10:04

2 Answers 2

3

Your code seems okay to me.

A few adjustments I would make:

  1. It's better to use the os library so it should be cross-platform. This is because, when you write os.remove('dogs/'+file_), the / is not cross platform. Would be better to use os.remove(os.path.join('dogs', file_)).

  2. You're wasting a lot of space holding the list of filenames to delete (Two lists of 10000 strings). If it doesn't matter to you which images to keep you could save a little bit of space (20%) by slicing:

    dogs_delete=os.listdir('dogs')[2000:]  # Take the last 8000 images
    for file_ in dogs_delete:
        os.remove(os.path.join('dogs', file_))
    

    If it does matter which images to keep, better to generate indices (less space):

    dogs_dir=os.listdir('dogs')
    for num in random.sample(len(dogs_dir), 8000):
        os.remove(os.path.join('dogs', dogs_dir[num]))
    
Sign up to request clarification or add additional context in comments.

Comments

2

Use random.sample() and the pathlib module:

from pathlib import Path
import random

def delete_images(directory, number_of_images, extension='jpg'):
    images = Path(directory).glob(f'*.{extension}')
    for image in random.sample(images, number_of_images):
        image.unlink()

delete_images('dogs', 8000)
delete_images('cats', 8000)    

Path('cats/').glob('*.jpg') returns a list of Path objects that represent files in the cats directory whose filenames end with .jpg.

random.sample(<something>, 8000) takes a random sample of 8000 items from a list.

Path().unlink() deletes a file.

1 Comment

@MohamedThasinah pathlib was added in Python 3.4 which was 5 years ago. I think it's not used much because it's not available on Python 2.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.