LBFGS gives memory error even though epoch is bigger than history_size

Doublejelly · November 5, 2019, 4:51am

Hello, I tried to use L-BFGS optimizer as
self.optimizer = optim.LBFGS(self.net.parameters(),max_iter=1)

As I know, it only saves data history_size times. When I run

for epoch in range(30000):
    self.optimizer.step(closure)
    self.train_loss.append(closure().cpu())
if (epoch%(n_print//self.iter)==0) & (epoch>0):
    print('[Epoch: %s], test_error: [%.4f, %.4f], %.3f seconds went by' %(self.iter*epoch, mean_loss, var_loss, time.time()-st_time))

it gives training records

[Epoch: 1000], test_error: [0.0022, 0.0294], 172.912 seconds went by
[Epoch: 2000], test_error: [0.0009, 0.0135], 290.782 seconds went by
[Epoch: 3000], test_error: [0.0006, 0.0099], 407.872 seconds went by
[Epoch: 4000], test_error: [0.0006, 0.0082], 524.397 seconds went by
[Epoch: 5000], test_error: [0.0007, 0.0063], 640.674 seconds went by
[Epoch: 6000], test_error: [0.0007, 0.0056], 756.475 seconds went by
[Epoch: 7000], test_error: [0.0008, 0.0050], 840.142 seconds went by
[Epoch: 8000], test_error: [0.0008, 0.0044], 958.420 seconds went by
[Epoch: 9000], test_error: [0.0008, 0.0038], 1074.280 seconds went by
[Epoch: 10000], test_error: [0.0008, 0.0032], 1191.190 seconds went by
[Epoch: 11000], test_error: [0.0009, 0.0026], 1309.013 seconds went by
[Epoch: 12000], test_error: [0.0009, 0.0025], 1427.236 seconds went by
[Epoch: 13000], test_error: [0.0009, 0.0021], 1544.839 seconds went by
[Epoch: 14000], test_error: [0.0009, 0.0020], 1661.872 seconds went by
[Epoch: 15000], test_error: [0.0009, 0.0017], 1690.628 seconds went by
[Epoch: 16000], test_error: [0.0009, 0.0015], 1808.329 seconds went by
[Epoch: 17000], test_error: [0.0009, 0.0013], 1874.316 seconds went by
[Epoch: 18000], test_error: [0.0010, 0.0014], 1991.727 seconds went by
[Epoch: 19000], test_error: [0.0010, 0.0011], 2109.931 seconds went by
[Epoch: 20000], test_error: [0.0010, 0.0012], 2192.570 seconds went by
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-4-ad2c47ee49d3> in <module>
----> 1 model.train_model(lr=1e0, max_iter = 30000, n_print=1000)

<ipython-input-2-e72f5af10348> in train_model(self, lr, max_iter, n_print)
    150 
    151         for epoch in range(max_iter//self.iter):
--> 152             self.optimizer.step(closure)
    153 
    154             mean_ResNet = []

/usr/local/lib/python3.6/dist-packages/torch/optim/lbfgs.py in step(self, closure)
    430                     # the reason we do this: in a stochastic setting,
    431                     # no use to re-evaluate that function here
--> 432                     loss = float(closure())
    433                     flat_grad = self._gather_flat_grad()
    434                     opt_cond = flat_grad.abs().max() <= tolerance_grad

<ipython-input-2-e72f5af10348> in closure()
    133             #pde_loss
    134             u_x = grad(u.sum(), X, create_graph=True)[0]
--> 135             tau_x = grad(tau.sum(), X, create_graph=True)[0]
    136 
    137             loss_constitutive = ((k * u_x + tau) ** 2).mean()

/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in grad(outputs, inputs, grad_outputs, retain_graph, create_graph, only_inputs, allow_unused)
    155     return Variable._execution_engine.run_backward(
    156         outputs, grad_outputs, retain_graph, create_graph,
--> 157         inputs, allow_unused)
    158 
    159 

RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 15.90 GiB total capacity; 15.23 GiB already allocated; 23.62 MiB free; 92.33 MiB cached)

I think memory error should happened before epoch=100 or it never happen, am I miss something?

On the other hand, I tried

self.optimizer = optim.LBFGS(self.net.parameters(),max_iter=500)
for epoch in range(600):
    self.optimizer.step(closure)
    self.train_loss.append(closure().cpu())
    print('[Epoch: %s], test_error: [%.4f, %.4f], %.3f seconds went by' %(self.iter*epoch,

mean_loss, var_loss, time.time()-st_time))

In other words, I changed (max_iter, max_epoch) from (1,30000) to (500,600) to compare.
All the time I set

np.random.seed(0)
torch.manual_seed(0)

to fix random seed, bit results give different test errors. Why this happends? Thank you for any helps!

------------------------------------------------------------------------added--------------------------------------------------
For the question 1, I found that train_loss.append part arise such an error, so

for epoch in range(30000):
    self.optimizer.step(closure)
   #self.train_loss.append(closure().cpu())

will not cause memory error.
How can I record my training loss without memory occupied?

Doublejelly · November 5, 2019, 8:47am

loss = self.optimizer.step(closure).data.cpu()
self.train_loss.append(loss)

resolved the problem. I’m waiting for the answer of the second question.

albanD · November 5, 2019, 3:32pm

Hi,

The memory error in your first question is because when you do self.train_loss.append(closure().cpu()) you save both the value of the loss and all the metadata that allows to compute gradients. You can use detach to fix this if you want as well self.train_loss.append(closure().cpu().detach()).

For the seed, you can check the notes on reproducibility but the short answer is that it is very hard to get reproducible results because of how float arithmetic works. And in your case, since you do operations in different orders, it is impossible to get the exact same result.

Doublejelly · November 5, 2019, 11:42pm

So, different results caused by the randomness? I got it, thank you so much!

albanD · November 6, 2019, 2:50pm

Not the randomness but the fact that for float numbers (a + b) + c != a + (b + c).