Training neural network in Google Colab CPU - second epoch does not start

Question

I am training a neural network on Google Colab CPU (I cannot use a GPU regarding another issue: FileNotFoundError: No such file: -> Error occuring only on GPU, not on CPU) with the fit_generator method.

model.fit_generator(generator=training_generator,
                    validation_data=validation_generator,
                    steps_per_epoch = num_train_samples // 128,
                    validation_steps = num_val_samples // 128,
                    epochs = 10,
                    use_multiprocessing=True,
                    workers=6)

The training for the first epoch seems to run fine, but the second does not start. The notebook does not break down or the iteration does not stop. However, the second epoch is not starting...

Is there something wrong with my code?

Vijeth Rai · Accepted Answer · 2020-06-19 17:19:43Z

2

Heyy

The epoch is very slow because it seems to be calculating validation loss and stuff.This is a common thing. You can only see training progress but not validation progress unless you build a custom callback regarding that.

The issue with your fit_generator is that you dont seem to have understood how to use steps_per_epoch and validation_steps. Unless your validation and train data have same size(number of images) they cant have same number of steps(I mean they "can" but you know what I mean)

I really recommend you use GPU for such data, since it is taking too long on CPU. Try debugging your code because GPU is so worth it.

answered Jun 19, 2020 at 17:19

Vijeth Rai

3193 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Tobitor Over a year ago

Yes, you are right... Thank you for your answer! steps_per_epoch and validation_steps are the same numbers. But I am wondering why this is the case? steps_per_epoch is defined as steps_per_epoch = num_train_samples // 128. And validation_steps = num_val_samples // 128. So, there should be a different number of steps because num_train_samples and num_val_samples are different... I checked this.. Or do I understand something wrong?

Vijeth Rai Over a year ago

steps_per_epoch is len(train_data)/batch_size or len(train_generator). Similarly for validation data it is len(validation_data)/batch_size or len(validation_generator). If you use wrong values, then your model will either underfit(low steps) or overfit(extra high steps)

Vijeth Rai Over a year ago

Your calculations are correct. You can try using a larger batch_size and more number of epochs so that validation doesnt have to wait that long. Google colab GPU takes 2mins-5mins to train on 1005 epoch_steps. But it may vary on your input.

Vijeth Rai Over a year ago

Steps are used when you are training on batches of data. Batches of data are used because of low computational resources, you simply cannot hold all the 100k images in your RAM because it will blow up. To prevent this, batches are used. We use len(val_gen)/batch_size because we want to validate all the validation images. If we use val_step=1 then only one of the batch of size = batch_size will be validated. Your validation is not very valid because it has not seen 70k more of your data. If your model validates only one batch then it cause validation output to be low variance and high bias

Vijeth Rai Over a year ago

It is more recommended to run a validation epoch after each train epoch so that we can monitor whether the model is over fitting or under fitting. There is no need to run more than one validation epoch per train epoch. I think you mean, run val epoch after complete training process. If you do that then there is no difference between validation and test data. Validation is unseen data used to measure performance of model for change in model weights. Ofcourse we "can" run val after entire training and know performance but then if there's over/under fit then you'd have to train the again

|

Collectives™ on Stack Overflow

Training neural network in Google Colab CPU - second epoch does not start

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related