2

I developed a model in Keras and trained it quite a few times. Once I forcefully stopped the training of the model and since then I am getting the following error:

Traceback (most recent call last):
  File "inception_resnet.py", line 246, in <module>
    callbacks=[checkpoint, saveEpochNumber])   ##
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 87, in wrapper
    return func(*args, **kwargs)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 2042, in fit_generator
    class_weight=class_weight)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/engine/training.py", line 1762, in train_on_batch
    outputs = self.train_function(ins)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 2270, in __call__
    session = get_session()
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py", line 163, in get_session
    _SESSION = tf.Session(config=config)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1486, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 621, in __init__
    self._session = tf_session.TF_NewDeprecatedSession(opts, status)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/home/eh0/E27890/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

So the error is actually

tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

And most probably, the GPU memory is still occupied. I can't even create a simple tensorflow session.

I have seen an answer here, but when I execute the following command in terminal

export CUDA_VISIBLE_DEVICES=''

training of the model gets started without GPU acceleration.

Also, as I am training my model on a server and I have no root access either to the server, I can't restart the server or clear GPU memory with root access. What is the solution now?

3 Answers 3

7

I found the solution in a comment of this question.

nvidia-smi -q

This gives a list of all the processes (and their PIDs) occupying GPU memory. I killed them one by one by using

kill -9 PID

Now everything is running smooth again.

Sign up to request clarification or add additional context in comments.

4 Comments

I got the same error while my GPU usage is all zero, how do you find processes taking gpu memory in the output of nvidia-smi -q
Find the "Processes" in the result, where you will find Process ID @K.Wanter
thank you, turns out my problem was caused by driver version insufficient for cuda runtime version, update drive solve the problem
The "Processes" shows blank in my system. Also nvidia-smi shows 0% GPU utilization. Is there any other way around without restarting the server?
1

I am using Anaconda 4.5.12 with python 3.5, NVIDIA Driver 390.116 and also faced the same issue. In my case this was caused by incompatible cudatoolkit version

conda install tensorflow-gpu

installed cudatoolkit 9.3.0 with cudnn 7.3.x. However after going through answers here and referring to my other virtual environment where I use pytorch with GPU without any problem I inferred that cudatookit 9.0.0 will be compatible with my driver version.

conda install cudatoolkit==9.0.0

This installed cudatoolkit 9.0.0 and cudnn 7.3.0 from cuda 9.0_0 build. After this I was able to create tensorflow session with GPU.

Now coming to the options of killing jobs

  • If you have GPU memory occupied by other jobs then killing them one by one as suggested by @Preetam saha arko will free up GPU and may allow you to create tf session with GPU (provided that compatibility issues are resolved already)
  • To create Session with specified GPU, kill the previous tf.Session() request after finding PID from nvidia-smi and set cuda visible device to available GPU ID (0 for this example)

    import os os.environ["CUDA_VISIBLE_DEVICES"]='0'

    Then using tf.Session can create session with specified GPU device.

  • Otherwise, if nothing with GPU works then kill the previous tf.Session() request after finding PID from nvidia-smi and set cuda visible device to undefined

    import os os.environ["CUDA_VISIBLE_DEVICES"]=''

    Then using tf.Session can create session with CPU.

Comments

0

I had the similar problem, while working on the cluster. When I submitted the job script to Slurm server , it would run fine but while training the model on Jupytyter notebook, I would get the following error :

InternalError: Failed to create session

Reason : It was because I was running multiple jupyter notebooks under same GPU (all of them using tensorflow), so slurm server would restrict to create a new tensorflow session. The problem was solved by stopping all the jupyter notebook, and then running only one/two at a time.

Below is the log error for jupyter notebook :

Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 12786073600

1 Comment

@obscure can you please clarify. Is my answering style wrong , or what I have mentioned in the answer is wrong ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.