Getting out of memory in validation even using torch.no_grad()

kalpeshjp89 · March 15, 2020, 4:57pm

My training code running good with around 8GB but when it goes into validation, it show me out of memory for 16GB GPU. I am using model.eval() and torch.no_grad() also but getting same. Here is my testing code for reference of testing which I am using in validation.

def test(self):
self.netG1.eval()
self.netG2.eval()
with torch.no_grad():
self.SR = self.netG1(self.var_L)
self.LR_est = self.netG2(self.SR)
self.DR = self.netG2(self.var_H)
self.HR_est = self.netG1(self.DR)
self.netG1.train()
self.netG2.train()

def get_current_visuals(self, need_HR=True):
out_dict = OrderedDict()
#out_dict[‘LR’] = self.var_L.detach()[0].float().cpu()
out_dict[‘SR’] = self.SR.detach()[0].float().cpu()
#out_dict[‘LR_est’] = self.LR_est.detach()[0].float().cpu()
if need_HR:
out_dict[‘HR’] = self.var_H.detach()[0].float().cpu()
#out_dict[‘DR’] = self.DR.detach()[0].float().cpu()
#out_dict[‘HR_Est’] = self.HR_est.detach()[0].float().cpu()
return out_dict

And my validation code is as bellow.
if current_step % opt[‘train’][‘val_freq’] == 0:
avg_psnr = 0.0
idx = 0
for val_data in val_loader:
idx += 1
img_name = os.path.splitext(os.path.basename(val_data[‘LR_path’][0]))[0]
img_dir = os.path.join(opt[‘path’][‘val_images’], img_name)
util.mkdir(img_dir)

                model.feed_data(val_data)
                model.test()

                visuals = model.get_current_visuals()
                sr_img = util.tensor2img(visuals['SR'])  # uint8
                gt_img = util.tensor2img(visuals['HR'])  # uint8

.
.
.
.
.
Waiting for any solution. Thanks in advance.

ptrblck · March 16, 2020, 6:02am

Which step is causing the OOM issue and is the mmeory usage 8GB, when you enter the test method?
Did you increase the batch size of self.var_L (and other inputs)?

kalpeshjp89 · March 16, 2020, 6:45am

Thank you for your quick response.
When training step is executed, it consumes less than 8 GB. But when it goes into validation, at “model.test()” step it throws error. And particular in testing code, it is in “self.LR_est = self.netG2(self.SR)” line.

Currently I stopped using two models in validation and now it is working(with only one model), but using two models it give me error.

In test method, it iteratively consumes more and more memory until all 16GB and then throwing error.

In testing, I am using batch size of 1 only.

ptrblck · March 16, 2020, 7:03am

This points towards a potential storage of a tensor, which is attached to the computation graph.
I cannot find anything obviously wrong in the posted code snippet, but could you check, if you are storing some outputs with a valid .grad_fn in a list/dict etc.?

kalpeshjp89 · March 16, 2020, 7:08am

def get_current_visuals(self, need_HR=True):
out_dict = OrderedDict()
#out_dict[‘LR’] = self.var_L.detach()[0].float().cpu()
out_dict[‘SR’] = self.SR.detach()[0].float().cpu()
#out_dict[‘LR_est’] = self.LR_est.detach()[0].float().cpu()
if need_HR:
out_dict[‘HR’] = self.var_H.detach()[0].float().cpu()
#out_dict[‘DR’] = self.DR.detach()[0].float().cpu()
#out_dict[‘HR_Est’] = self.HR_est.detach()[0].float().cpu()
return out_dict

This is my code for storing data from validation. I used detach here also.

Oktai15 · May 26, 2020, 5:57pm

Please, check this thread: Increase the CUDA memory twice then stop increasing