Memory/computation on cuda / cpu

I’m having a problem of running out of cuda memory during training, and some parts of it seem a bit odd, I’ll explain the current setup and my thoughts of how to fix it, but I have a feeling it will not work.

Running on cuda i have
I have 3 networks, policy, q1, q2. The q networks in particular are quite large and fairly complex
A tensor for an entropy param (it’s a size(1) tensor)

When entering the train loop, i sample a batch of size batch_size from a replay buffer. This is used with target nets on cpu with no grad to calculate q_target values.

I then move the relevant parts of the replay buffer to cuda and run them into my q1 network. I move the q_target results (and yes these are fully detached, just tensors of values) to cuda also.

I do mse_loss between the q1 outputs and the q_targets.

I then call backward() on the loss, then optim.step(), and i get this error
RuntimeError: CUDA out of memory. Tried to allocate 530.00 MiB (GPU 0; 4.00 GiB total capacity; 2.32 GiB already allocated; 518.99 MiB free; 9.51 MiB cached)

What I’m finding curious is that this is roughly the same values even if i bring the batch size down from 100 to 5, and similarly still happens if I remove the q2 network.

This suggests to me that largely the data tensors and networks sitting on gpu dont affect the memory usage much, but the size of the computation graph / optimisation is what’s causing the problem?

If this is the case and i simply cannot run a network of this size on cuda i do have vastly more space on cpu, but since i use the q network to evaluate the policy network (with the reparametization trick) i assume gradient/computation graphs do not work across devices? And if I moved or created a copy of the q network on cuda to run this it would presumably now being in even larger computation graph fail also?

Any thoughts on how to resolve this would be greatly appreciated.


You can actually move stuff between cpu and cuda and the gradients will still be computed properly.
Note that the backward ops are performed on the same device where the forward op was done.
So moving the result of your gpu computation to cpu won’t reduce the cpu usage.

So, to clarify, this should mean that if my loss function for the policy network were just loss = q1(state, policy_output), where policy_output was calculated on gpu but q1 was on cpu, i would move the policy_output to cpu (this would be required for it to function I think?), and then a call of loss.backward() would correctly backprop through the policy network?

My thanks for your help

Yes, you will have to move policy_output to the cpu yourself (pytorch never move stuff between the cpu and gpu automatically). But then yes the backprop will work just fine. Note that the q1 backprop will be computed on the cpu and the policy network one will be computed on the gpu in your particular case.