Sampling using a network vs memory usage

jian_zhang · June 28, 2018, 2:43pm

Hi,

I am trying to do sampling from a network (during training) in order to compute loss function. However, I am getting RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58 if the sampling times is too large, which I don’t understand.

The Pseudo-code (sorry, the actual code is very large):

#sampling_network is pre-trained and no updates during this training
for param in sampling_network.parameters():
        param.requires_grad = False

optimizer.zero_grad()

sampling_times = N
total_loss = 0.

sampling_inputs = main_network(inputs)
#do sampling
for sampling_time in range(sampling_times):
	prediction = sampling_network(sampling_inputs)
	loss = compute_loss(prediction)
	total_loss += loss

#fixed typo
total_loss.backward()
optimizer.step()

In my case, the training is fine when N <=5, but throws “out of memory” error when N > 5.

What I don’t understand is the memory usage should be regardless with the setting of N as the same sampling_network is just simply called multiple times. Am i missing something?

Thanks,

ptrblck · June 28, 2018, 2:46pm

What are you doing with total_loss?
Currently you are storing the computation graph in it.
If you just need it for printing, you should use:

total_loss += loss.item()

Or do you need it somewhere for a backward pass?

jian_zhang · June 28, 2018, 2:47pm

Sorry, typo, it should be

total_loss.backward()

Also fixed in main thread.

ptrblck · June 28, 2018, 2:52pm

OK, that makes sense.
The memory usage won’t stay the same, since for each pass a new computation graph is created and stored.
You could call .backward() in the for loop and optimizer.step() outside of it.

jian_zhang · June 28, 2018, 2:59pm

@ptrblck, Thanks. If I understand it correctly, calling .backward() with in the loop and step() outside of the loop will make the gradients to be computed at ever sampling time, and the trainable variables to be updated in the end of the sampling process. And this will have exactly the same effects (in terms of learning) to the network, but more memory efficient. Am I right?

ptrblck · June 28, 2018, 3:00pm

Yes, you will save some memory but need more compute, since the gradients will be calculated in every step.
Besides that it should be identical.

jian_zhang · June 28, 2018, 3:01pm

Cool, thanks. I understand it now.