Hello, I tried to use L-BFGS optimizer as
self.optimizer = optim.LBFGS(self.net.parameters(),max_iter=1)
As I know, it only saves data history_size times. When I run
for epoch in range(30000):
self.optimizer.step(closure)
self.train_loss.append(closure().cpu())
if (epoch%(n_print//self.iter)==0) & (epoch>0):
print('[Epoch: %s], test_error: [%.4f, %.4f], %.3f seconds went by' %(self.iter*epoch, mean_loss, var_loss, time.time()-st_time))
it gives training records
[Epoch: 1000], test_error: [0.0022, 0.0294], 172.912 seconds went by
[Epoch: 2000], test_error: [0.0009, 0.0135], 290.782 seconds went by
[Epoch: 3000], test_error: [0.0006, 0.0099], 407.872 seconds went by
[Epoch: 4000], test_error: [0.0006, 0.0082], 524.397 seconds went by
[Epoch: 5000], test_error: [0.0007, 0.0063], 640.674 seconds went by
[Epoch: 6000], test_error: [0.0007, 0.0056], 756.475 seconds went by
[Epoch: 7000], test_error: [0.0008, 0.0050], 840.142 seconds went by
[Epoch: 8000], test_error: [0.0008, 0.0044], 958.420 seconds went by
[Epoch: 9000], test_error: [0.0008, 0.0038], 1074.280 seconds went by
[Epoch: 10000], test_error: [0.0008, 0.0032], 1191.190 seconds went by
[Epoch: 11000], test_error: [0.0009, 0.0026], 1309.013 seconds went by
[Epoch: 12000], test_error: [0.0009, 0.0025], 1427.236 seconds went by
[Epoch: 13000], test_error: [0.0009, 0.0021], 1544.839 seconds went by
[Epoch: 14000], test_error: [0.0009, 0.0020], 1661.872 seconds went by
[Epoch: 15000], test_error: [0.0009, 0.0017], 1690.628 seconds went by
[Epoch: 16000], test_error: [0.0009, 0.0015], 1808.329 seconds went by
[Epoch: 17000], test_error: [0.0009, 0.0013], 1874.316 seconds went by
[Epoch: 18000], test_error: [0.0010, 0.0014], 1991.727 seconds went by
[Epoch: 19000], test_error: [0.0010, 0.0011], 2109.931 seconds went by
[Epoch: 20000], test_error: [0.0010, 0.0012], 2192.570 seconds went by
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-4-ad2c47ee49d3> in <module>
----> 1 model.train_model(lr=1e0, max_iter = 30000, n_print=1000)
<ipython-input-2-e72f5af10348> in train_model(self, lr, max_iter, n_print)
150
151 for epoch in range(max_iter//self.iter):
--> 152 self.optimizer.step(closure)
153
154 mean_ResNet = []
/usr/local/lib/python3.6/dist-packages/torch/optim/lbfgs.py in step(self, closure)
430 # the reason we do this: in a stochastic setting,
431 # no use to re-evaluate that function here
--> 432 loss = float(closure())
433 flat_grad = self._gather_flat_grad()
434 opt_cond = flat_grad.abs().max() <= tolerance_grad
<ipython-input-2-e72f5af10348> in closure()
133 #pde_loss
134 u_x = grad(u.sum(), X, create_graph=True)[0]
--> 135 tau_x = grad(tau.sum(), X, create_graph=True)[0]
136
137 loss_constitutive = ((k * u_x + tau) ** 2).mean()
/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in grad(outputs, inputs, grad_outputs, retain_graph, create_graph, only_inputs, allow_unused)
155 return Variable._execution_engine.run_backward(
156 outputs, grad_outputs, retain_graph, create_graph,
--> 157 inputs, allow_unused)
158
159
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 15.90 GiB total capacity; 15.23 GiB already allocated; 23.62 MiB free; 92.33 MiB cached)
I think memory error should happened before epoch=100 or it never happen, am I miss something?
On the other hand, I tried
self.optimizer = optim.LBFGS(self.net.parameters(),max_iter=500)
for epoch in range(600):
self.optimizer.step(closure)
self.train_loss.append(closure().cpu())
print('[Epoch: %s], test_error: [%.4f, %.4f], %.3f seconds went by' %(self.iter*epoch,
mean_loss, var_loss, time.time()-st_time))
In other words, I changed (max_iter, max_epoch) from (1,30000) to (500,600) to compare.
All the time I set
np.random.seed(0)
torch.manual_seed(0)
to fix random seed, bit results give different test errors. Why this happends? Thank you for any helps!
------------------------------------------------------------------------added--------------------------------------------------
For the question 1, I found that train_loss.append part arise such an error, so
for epoch in range(30000):
self.optimizer.step(closure)
#self.train_loss.append(closure().cpu())
will not cause memory error.
How can I record my training loss without memory occupied?