0

I am training a neural network on Google Colab CPU (I cannot use a GPU regarding another issue: FileNotFoundError: No such file: -> Error occuring only on GPU, not on CPU) with the fit_generator method.

model.fit_generator(generator=training_generator,
                    validation_data=validation_generator,
                    steps_per_epoch = num_train_samples // 128,
                    validation_steps = num_val_samples // 128,
                    epochs = 10,
                    use_multiprocessing=True,
                    workers=6)

The training for the first epoch seems to run fine, but the second does not start. The notebook does not break down or the iteration does not stop. However, the second epoch is not starting...

Is there something wrong with my code?

1 Answer 1

2

Heyy

The epoch is very slow because it seems to be calculating validation loss and stuff.This is a common thing. You can only see training progress but not validation progress unless you build a custom callback regarding that.

The issue with your fit_generator is that you dont seem to have understood how to use steps_per_epoch and validation_steps. Unless your validation and train data have same size(number of images) they cant have same number of steps(I mean they "can" but you know what I mean)

I really recommend you use GPU for such data, since it is taking too long on CPU. Try debugging your code because GPU is so worth it.

Sign up to request clarification or add additional context in comments.

10 Comments

Yes, you are right... Thank you for your answer! steps_per_epoch and validation_steps are the same numbers. But I am wondering why this is the case? steps_per_epoch is defined as steps_per_epoch = num_train_samples // 128. And validation_steps = num_val_samples // 128. So, there should be a different number of steps because num_train_samples and num_val_samples are different... I checked this.. Or do I understand something wrong?
steps_per_epoch is len(train_data)/batch_size or len(train_generator). Similarly for validation data it is len(validation_data)/batch_size or len(validation_generator). If you use wrong values, then your model will either underfit(low steps) or overfit(extra high steps)
Your calculations are correct. You can try using a larger batch_size and more number of epochs so that validation doesnt have to wait that long. Google colab GPU takes 2mins-5mins to train on 1005 epoch_steps. But it may vary on your input.
Steps are used when you are training on batches of data. Batches of data are used because of low computational resources, you simply cannot hold all the 100k images in your RAM because it will blow up. To prevent this, batches are used. We use len(val_gen)/batch_size because we want to validate all the validation images. If we use val_step=1 then only one of the batch of size = batch_size will be validated. Your validation is not very valid because it has not seen 70k more of your data. If your model validates only one batch then it cause validation output to be low variance and high bias
It is more recommended to run a validation epoch after each train epoch so that we can monitor whether the model is over fitting or under fitting. There is no need to run more than one validation epoch per train epoch. I think you mean, run val epoch after complete training process. If you do that then there is no difference between validation and test data. Validation is unseen data used to measure performance of model for change in model weights. Ofcourse we "can" run val after entire training and know performance but then if there's over/under fit then you'd have to train the again
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.