In PyHessian
package, there is a place for getting the hessian-vector product (PyHessian/pyhessian/hessian.py at b3be908c9d6070ac91530d39231685dd0a5506f7 · amirgholami/PyHessian · GitHub)
What I don’t understand is that my memory keeps going up with this for inputs in self.data:
loop.
def dataloader_hv_product(self, v):
device = self.device
num_data = 0 # count the number of datum points in the dataloader
THv = [torch.zeros(p.size()).to(device) for p in self.params
] # accumulate result
for inputs in self.data:
self.model.zero_grad()
tmp_num_data = len(inputs)#.size(0)
outputs = self.model(inputs.to(device), training=True, compute_force=True)
#print_cudamem("After forward pass")
loss = self.criterion(pred=outputs, ref=inputs.to(device))
loss.backward(create_graph=True)
params, gradsH = get_params_grad(self.model)
Hv = torch.autograd.grad(gradsH,
params,
grad_outputs=v,
only_inputs=True,
retain_graph=False)
THv = [
THv1 + Hv1 * float(tmp_num_data) + 0.
for THv1, Hv1 in zip(THv, Hv)
]
num_data += float(tmp_num_data)
# 🚀 Delete unused tensors and clear memory
print_cudamem("Before clean up")
self.model.zero_grad(set_to_none=True)
for p in params:
p.grad = None
for g in gradsH:
g.grad = None
for vi in v:
vi.grad = None
for tensor in [gradsH, Hv, params, loss, outputs, inputs]:
if isinstance(tensor, list):
for t in tensor:
del t
del tensor
gc.collect()
torch.cuda.empty_cache()
print_cudamem("After batch cleanup")
There is also an Running {i} th iteration
loop outside calling this dataloader_hv_product
:
Before clean up: Allocated: 1.844 GB, Max allocated: 4.380 GB, Cached: 4.857 GB
After batch cleanup: Allocated: 1.843 GB, Max allocated: 4.380 GB, Cached: 2.336 GB
Before clean up: Allocated: 2.659 GB, Max allocated: 5.269 GB, Cached: 5.664 GB
After batch cleanup: Allocated: 2.658 GB, Max allocated: 5.269 GB, Cached: 3.387 GB
Running 1 th iteration
Before clean up: Allocated: 3.441 GB, Max allocated: 5.974 GB, Cached: 6.451 GB
After batch cleanup: Allocated: 3.440 GB, Max allocated: 5.974 GB, Cached: 4.461 GB
Before clean up: Allocated: 4.252 GB, Max allocated: 6.864 GB, Cached: 7.351 GB
After batch cleanup: Allocated: 4.251 GB, Max allocated: 6.864 GB, Cached: 5.400 GB
Running 2 th iteration
Before clean up: Allocated: 5.032 GB, Max allocated: 7.567 GB, Cached: 8.043 GB
After batch cleanup: Allocated: 5.032 GB, Max allocated: 7.567 GB, Cached: 6.203 GB
Before clean up: Allocated: 5.844 GB, Max allocated: 8.457 GB, Cached: 8.944 GB
After batch cleanup: Allocated: 5.844 GB, Max allocated: 8.457 GB, Cached: 7.051 GB
Running 3 th iteration
Before clean up: Allocated: 6.624 GB, Max allocated: 9.160 GB, Cached: 9.649 GB
After batch cleanup: Allocated: 6.623 GB, Max allocated: 9.160 GB, Cached: 7.829 GB
Before clean up: Allocated: 7.436 GB, Max allocated: 10.049 GB, Cached: 10.540 GB
After batch cleanup: Allocated: 7.436 GB, Max allocated: 10.049 GB, Cached: 8.703 GB
Running 4 th iteration
Before clean up: Allocated: 8.217 GB, Max allocated: 10.748 GB, Cached: 11.232 GB
After batch cleanup: Allocated: 8.216 GB, Max allocated: 10.748 GB, Cached: 9.357 GB
Before clean up: Allocated: 9.031 GB, Max allocated: 11.647 GB, Cached: 12.138 GB
After batch cleanup: Allocated: 9.030 GB, Max allocated: 11.647 GB, Cached: 10.347 GB
I added some manual gc and cache clean up but it doesn’t help much. Would be helpful is someone points out clear issue of my code.