CUDA memory issue in Hessian vector product

In PyHessian package, there is a place for getting the hessian-vector product (PyHessian/pyhessian/hessian.py at b3be908c9d6070ac91530d39231685dd0a5506f7 · amirgholami/PyHessian · GitHub)

What I don’t understand is that my memory keeps going up with this for inputs in self.data: loop.

    def dataloader_hv_product(self, v):
        device = self.device
        num_data = 0  # count the number of datum points in the dataloader

        THv = [torch.zeros(p.size()).to(device) for p in self.params
              ]  # accumulate result
        for inputs in self.data:
            self.model.zero_grad()
            tmp_num_data = len(inputs)#.size(0)
            outputs = self.model(inputs.to(device), training=True, compute_force=True)
            #print_cudamem("After forward pass")
            loss = self.criterion(pred=outputs, ref=inputs.to(device))
            loss.backward(create_graph=True)
            params, gradsH = get_params_grad(self.model)
            Hv = torch.autograd.grad(gradsH,
                                     params,
                                     grad_outputs=v,
                                     only_inputs=True,
                                     retain_graph=False)
            THv = [
                THv1 + Hv1 * float(tmp_num_data) + 0.
                for THv1, Hv1 in zip(THv, Hv)
            ]
            num_data += float(tmp_num_data)

            # 🚀 Delete unused tensors and clear memory
            print_cudamem("Before clean up")
            self.model.zero_grad(set_to_none=True)
            for p in params:
                p.grad = None
            for g in gradsH:
                g.grad = None
            for vi in v:
                vi.grad = None
            for tensor in [gradsH, Hv, params, loss, outputs, inputs]:
                if isinstance(tensor, list):
                    for t in tensor:
                        del t
                del tensor
            gc.collect()
            torch.cuda.empty_cache()
            print_cudamem("After batch cleanup")

There is also an Running {i} th iteration loop outside calling this dataloader_hv_product:

Before clean up: Allocated: 1.844 GB, Max allocated: 4.380 GB, Cached: 4.857 GB
After batch cleanup: Allocated: 1.843 GB, Max allocated: 4.380 GB, Cached: 2.336 GB
Before clean up: Allocated: 2.659 GB, Max allocated: 5.269 GB, Cached: 5.664 GB
After batch cleanup: Allocated: 2.658 GB, Max allocated: 5.269 GB, Cached: 3.387 GB
Running 1 th iteration
Before clean up: Allocated: 3.441 GB, Max allocated: 5.974 GB, Cached: 6.451 GB
After batch cleanup: Allocated: 3.440 GB, Max allocated: 5.974 GB, Cached: 4.461 GB
Before clean up: Allocated: 4.252 GB, Max allocated: 6.864 GB, Cached: 7.351 GB
After batch cleanup: Allocated: 4.251 GB, Max allocated: 6.864 GB, Cached: 5.400 GB
Running 2 th iteration
Before clean up: Allocated: 5.032 GB, Max allocated: 7.567 GB, Cached: 8.043 GB
After batch cleanup: Allocated: 5.032 GB, Max allocated: 7.567 GB, Cached: 6.203 GB
Before clean up: Allocated: 5.844 GB, Max allocated: 8.457 GB, Cached: 8.944 GB
After batch cleanup: Allocated: 5.844 GB, Max allocated: 8.457 GB, Cached: 7.051 GB
Running 3 th iteration
Before clean up: Allocated: 6.624 GB, Max allocated: 9.160 GB, Cached: 9.649 GB
After batch cleanup: Allocated: 6.623 GB, Max allocated: 9.160 GB, Cached: 7.829 GB
Before clean up: Allocated: 7.436 GB, Max allocated: 10.049 GB, Cached: 10.540 GB
After batch cleanup: Allocated: 7.436 GB, Max allocated: 10.049 GB, Cached: 8.703 GB
Running 4 th iteration
Before clean up: Allocated: 8.217 GB, Max allocated: 10.748 GB, Cached: 11.232 GB
After batch cleanup: Allocated: 8.216 GB, Max allocated: 10.748 GB, Cached: 9.357 GB
Before clean up: Allocated: 9.031 GB, Max allocated: 11.647 GB, Cached: 12.138 GB
After batch cleanup: Allocated: 9.030 GB, Max allocated: 11.647 GB, Cached: 10.347 GB

I added some manual gc and cache clean up but it doesn’t help much. Would be helpful is someone points out clear issue of my code.