Hi, I met an out-of-memory error when I use closure in my new optimizer, which needs the value of the loss f(x) to update the parameters x in each iteration. The scheme is like
g(x) = f’(x)/f(x)
x = x - a*g(x)
So I define a closure function before optimizer.step
def closure():
optimizer.zero_grad()
outputs = net(inputs)
loss = criterion(outputs, targets)
loss.backward()
return loss, outputs
loss, outputs = optimizer.step(closure)
It works fine when I apply this optimizer to train a CNN model for MNIST. However, when I use this optimizer to train ResNet34 for cifar10, even on an HPC cluster, the program will be killed after a few iterations because of an out-of-memory error.
I think the memory of the compute node (128G) is large enough, and it works fine when I change the optimizer to torch.optim.SGD, with the same other settings. The corresponding code for SGD is:
optimizer.zero_grad()
outputs = net(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
So the only difference I can notice is the use of closure function in the new optimizer.
I have two questions:
- Did I use the closure function correctly? So far this new optimizer works fine for some smaller data like MNIST, but since closure is rarely used in optimizers, and there are not many examples, so I am not so sure if I used the closure correctly.
- Is the out-of-memory error caused by the use of closure function? It’s not clear for me how the closure and optimizer.step work in PyTorch so I have no idea where this out-of-memory error comes from.
Any help is highly appreciated! Thanks!