RAM keep increasing in inference [SOLVED]

Hi all,

I’m encountering a problem where my RAM is during inference of multiple models (the GPU memory is released though).
I’ve trained 6 models with binary classification and now i’m trying to do inference of all the 6 models one after the other and i’m for some reason my RAM keep increasing like i have a memory leak problem somewhere in my code but i just don’t know where.

Each of the 6 inference models is I3D, and i’m passing the output of the last layer into a model that will output 6 outputs (I’m doing ensemble on the 6 inference models)

I made .eval() on all my inference models and made sure their .require_grad = False is set too.

I tried setting torch.backends.cudnn.benchmark = False and torch.backends.cudnn.benchmark = True, no matter what the consequence was that the RAM was still exploding.

This is the code i’m using for inference:

def load_inference_model(kinetics_weights, checkpoint_weights, use_half, use_dataparallel, convert_to_cuda_before_loading_weights=False):
	model = I3D(num_classes=400, modality='rgb', use_spatial=True)

	if kinetics_weights is not None:

	model.conv3d_0c_1x1 = Unit3Dpy(
		kernel_size=(1, 1, 1),

	if convert_to_cuda_before_loading_weights:
		state_dict = torch.load(checkpoint_weights)["state_dict"]
		# create new OrderedDict that does not contain `module.`
		from collections import OrderedDict
		new_state_dict = OrderedDict()
		for k, v in state_dict.items():
			name = k[7:] # remove `module.`
			new_state_dict[name] = v
		state_dict = new_state_dict
		state_dict = torch.load(checkpoint_weights)["state_dict"]
	model = convert_to_cuda(model, use_half, use_dataparallel)

	for p in model.module.parameters():
		p.requires_grad = False

	return model

def do_single_inference(inputs, model, mean , std, use_half):
	normalized_input = (inputs - mean)/std

	if use_half:
		inputs_var = Variable(normalized_input.cuda().half(), volatile=True)
		inputs_var = Variable(normalized_input.cuda(), volatile=True)

	score, softmax = model(inputs_var)

	return score.data

def do_inference(inputs, models, means, stds, use_half):
	scores = torch.zeros([4, 12]) #4 = batch_size i'm using

	#inference 8 models
	scores[:, 0:2] = do_single_inference(inputs, models[0], means[0], stds[0], use_half)

	scores[:, 2:4] = do_single_inference(inputs, models[1], means[1], stds[1], use_half)

	scores[:, 4:6] = do_single_inference(inputs, models[2], means[2], stds[2], use_half)

	scores[:, 6:8] = do_single_inference(inputs, models[3], means[3], stds[3], use_half)

	scores[:, 8:10] = do_single_inference(inputs, models[4], means[4], stds[4], use_half)

	scores[:, 10:12] = do_single_inference(inputs, models[5], means[5], stds[5], use_half)

	return scores

def train_micro_batches(epoch, model, ensemble_models, models_mean, models_std, steps_per_epoch, num_micro_batches, data_loader, use_half):
	stateful_metrics = ["Loss", "Acc"]
	progress_bar = ProgressBar(steps_per_epoch, stateful_metrics=stateful_metrics)
	running_loss = 0.0
	running_corrects = 0
	total = 0
	loss_avg = 0.0
	acc_avg = 0.0
	data_loader_iter = iter(data_loader)

	for i in range(steps_per_epoch):
		batch_loss_value = 0


		for j in range(num_micro_batches):
			inputs, targets = next(data_loader_iter)
			scores = do_inference(inputs, ensemble_models, models_mean, models_std, use_half)

			if use_half:
				scores, targets = Variable(scores.cuda().half()), Variable(targets.cuda())
				scores, targets = Variable(scores.cuda()), Variable(targets.cuda())

			score = model(scores)
			loss = criterion(score, targets)

			if use_half:
				batch_loss_value += loss.data.cpu()[0]
				batch_loss_value += loss.data.cpu().numpy()[0]

			_, predicted = torch.max(score.data, 1)
			total += targets.size(0)
			running_corrects += predicted.eq(targets.data).cpu().sum()


		running_loss += batch_loss_value/num_micro_batches

		acc_avg = running_corrects / total
		loss_avg = running_loss / (i + 1)

		vals = [("Loss", "{:0.4f}".format(loss_avg)), ("Acc", "{:0.4f}".format(acc_avg))]

		progress_bar.update(i+1, vals)

I haven’t found the memory issue yet, but for now you could try split the two stages of your training.
Basically, you would run the inference on your stage 1 models, save the scores, and keep an eye on the memory usage.
Then you would load these scores in another script and train your stage 2 model.
Since you don’t need to backpropagate through the stage 1, it shouldn’t be a problem.

Would this be an option? You could narrow the memory leak to one of the two stages.

I tried your suggestion and looks like my memory leak is in stage1 but i have idea where

Ok, so we are hunting it down. :wink:

I’ve created a sample code snippet and tried to simulate your use case:

class SmallModel(nn.Module):
    def __init__(self):
        super(SmallModel, self).__init__()
        self.act = nn.ReLU()
        self.fc1 = nn.Linear(1000, 100)
        self.fc2 = nn.Linear(100, 2)
    def forward(self, x):
        x = self.act(self.fc1(x))
        x = F.log_softmax(self.fc2(x), dim=1)
        return x

class NetEnsemble(nn.Module):
    def __init__(self, models):
        super(NetEnsemble, self).__init__()
        self.ensemble_models = nn.ModuleList(models)
        self.fc = nn.Linear(12, 6)
        self.mean = torch.randn(1000).cuda()
        self.std = torch.randn(1000).cuda()
    def forward(self, x):
        log_softmaxes = []
        with torch.no_grad():
            x_normalized = (x - Variable(self.mean)) / Variable(self.std)
            for model in self.ensemble_models:
                log_softmax = model(x_normalized)
        out = torch.cat(log_softmaxes, 1)
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        return F.log_softmax(out, dim=1)

y = Variable(torch.LongTensor(10).random_(6).cuda())
x = Variable(torch.randn(10, 1000).cuda())

models = [SmallModel() for i in range(6)]
ens_model = NetEnsemble(models)

criterion = nn.NLLLoss()
optimizer = optim.Adam(ens_model.fc.parameters())

mem_alloc = torch.cuda.memory_allocated()
mem_cache = torch.cuda.memory_cached()

epochs = 10000
for epoch in range(epochs):
    output = ens_model(x)

    loss = criterion(output, y) 

    print('Stage2 grad: {}'.format(ens_model.fc.weight.grad))
    print('Stage1 grad: {}'.format(
    print('Epoch {} Loss {}'.format(epoch, loss.data[0]))
    print('New allocated memory {}\tnew cached memory {}'.format(
        torch.cuda.memory_allocated() - mem_alloc,
        torch.cuda.memory_cached() - mem_cache))
    mem_alloc = torch.cuda.memory_allocated()
    mem_cache = torch.cuda.memory_cached()

Could you check, if your stage 1 models get somehow gradients?
The volatile flag should take care of this, so I doubt it’s the issue, but somehow we have to get to the leak.
I’ve built PyTorch from source, that’s why I’m using with torch.no_grad() instead of volatile.

i will try to run your code and replace the with torch.no_grad() with volatile=True since the pytorch i’m using is not compiled from the source code.

how can i check if my stage 1 models get gradients?

You could adapt this line to your model structure:

print('Stage1 grad: {}'.format(

It should return None if no gradients were calculated.

cool i will try that.
I tried running your code and came across a problem:
AttributeError: module 'torch.cuda' has no attribute 'memory_allocated' - i guess it’s because my pytorch is not compiled from the source code?

Yeah, this method was probably added in a later version.
You could just skip it for the moment and have a look at the memory usage in nvidia-smi.

Traceback (most recent call last): File "test.py", line 69, in <module> loss.backward() File "C:\Users\yana\AppData\Local\Continuum\anaconda3\envs\fastai\lib\site-packages\torch\autograd\variable.py", line 167, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables) File "C:\Users\yana\AppData\Local\Continuum\anaconda3\envs\fastai\lib\site-packages\torch\autograd\__init__.py", line 99, in backward variables, grad_variables, retain_graph) RuntimeError: element 0 of variables tuple is volatile

i’m getting this error now

maybe i should add:
out.volatile = False out.requires_grad = True

I ran the test and added (out.volatile=False and out.requires_grad=True) and it worked fine.

the results are: the GPUs RAM not increasing so is the CPU RAM is not increasing.

I also added the print for stage1 and the grad is None and my RAM on CPU keep increasing.

out should require gradients, since it’s the trainable part of your stage 2 model.
The GPU RAM is fine now? The CPU RAM is still increasing or not?

yup, GPU RAM is fine but the CPU RAM still increasing

1 Like

I’m not seeing any obvious mistakes in your code.
Could you try to build from master? You can find the instructions here. It should be quite easy. Let me know, if you encounter any problems while installing.

OK, so I managed to find a workaround on RAM increasing, the solution was your previous suggestion to my other post to do the standardization in the Dataset and return multiple inputs. I think it was because i normalized my inputs inside the wrapper/ensemble model maybe there were variables that were still exists and caused the memory leak.
Thank you for all your help!
means a lot!

1 Like

You’re welcome, I’m glad you’ve found the cause for the memory increase.
It’s interesting, that the normalization seems to be the bad guy here.
I’ll have another look at it and try to find out why this happened.

I got the same situation while inference mode. I know that we need to do the standardization in the Dataset before feeds them into the model.

So in training code, I have:

# dataloader.py
self.transform = transforms.Compose([transforms.RandomResizedCrop(224),
                                     transforms.Normalize(mean=mean, std=std)])
# train.py
logits = self.model(imgs)   #Resnet101

It totally fine but you know, we trained our model with 224x224px and maybe do some data augmentation to force CNN to learn the recognize the image features/pattern in different regions of the images. However, when inference, the size of the image can vary since we using Global Average Pooling, like 224x400px or 400x224px, it’s a bad idea if we apply CenterCroping or Resize to 224x224px in here.

So in inference code, I have:

# dataloader.py
self.transform = transforms.Compose([transforms.Resize(224), # resize the smallest edge to 224px.
                                     transforms.Normalize(mean=mean, std=std)])
# train.py
logits = self.model(imgs)   #Resnet101

The memory leak happens. Do you guys have any idea?

I also had RAM increase problem during inference. I solved my problem by set pin_memory = False in the Dataloader. "pin_memory = True " didn’t cause any problem during training.