0

I need to train a very large number of Neural Nets using Tensorflow with Python. My neural nets (MLP) are ranging from very small ones (~ 2 Hidden Layers with ~30 Neurons each) to large ones (3-4 Layers with >500 neurons each).

I am able to run all of them sequencially on my GPU, which is fine. But my CPU is almost idling. Additionally I found out, that my CPU is quicker than the GPU for my very small nets (I assume because of the GPU-Overhead etc...). Thats why I want to use both my CPU and my GPU in parallel to train my nets. The CPU should process the smaller networks to the larger ones, and my GPU should process from the larger to the smaller ones, until they meet somewhere in the middle... I thought, this is a good idea :-)

So I just simply start my consumers twice in different processes. The one with device = CPU, the other one with device = GPU. Both are starting and consuming the first 2 nets as expected. But then, the GPU-consumer throws an Exception, that his tensor is accessed/violated by another process on the CPU(!), which I find weird, because it is supposed to run on the GPU...

Can anybody help me, to fully segregate my to processes?

4
  • 1
    It won't work as you expect, if you are using both CPU and GPU. They need more communication which is much slower than calculation in GPU. So it's OK, and running a small one is slower using GPU because there will be more data transporting/communication. Commented Oct 25, 2017 at 9:51
  • Further, "3-4 Layers with >500 neurons each" is just a baby net. Commented Oct 25, 2017 at 9:55
  • @Sraw: Thanks for answering. But to clarify: I dont want to train ONE SINGLE net on CPU and GPU together. I want the CPU to train net1 and GPU to train net2 independently. So there is no need for communication between both processes. In fact I want to completely segregate them. I just want to use the calc-power of my idling CPU.. Commented Oct 25, 2017 at 10:09
  • @Sraw: You're probbably right with the baby-net post, but my dataset isn't challenging enough for larger networks... Commented Oct 25, 2017 at 10:10

1 Answer 1

1

Do any of your networks share operators? E.g. they use variables with the same name in the same variable_scope which is set to variable_scope(reuse=True)

Then multiple nets will try to reuse the same underlying Tensor structures.

Also check it tf.ConfigProto.allow_soft_placement is set to True or False in your tf.Session. If True you can't be guaranteed that the device placement will be actually executed in the way you intended in your code.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.