Memory allocation picks on GPU when starting to run a model


I experience some problem when I’m running my model in PyTorch.
In the first forward iteration, the GPU memory allocated is rising to a high value for a few moments and then stabilizes into a fixed value.

My code is:

class SSD(nn.Module):
    """Single Shot Multibox Architecture
    The network is composed of a base VGG network followed by the
    added multibox conv layers.  Each multibox layer branches into
        1) conv2d for class conf scores
        2) conv2d for localization predictions
        3) associated priorbox layer to produce default bounding
           boxes specific to the layer's feature map size.
    See: for more details.

        phase: (string) Can be "test" or "train"
        base: VGG16 layers for input, size of either 300 or 500
        extras: extra layers that feed to multibox loc and conf layers
        head: "multibox head" consists of loc and conf conv layers

    def __init__(self, phase, base, extras, head, num_classes, cfg, det_transform=None, run_sequential=False):
        super(SSD, self).__init__()
        self.phase = phase
        self.num_classes = num_classes
        # TODO: implement __call__ in PriorBox
        self.priorbox = PriorBox(cfg)
        self.priors = Variable(self.priorbox.forward(), volatile=True)
        self.size = 512

        # SSD network
        self.vgg = nn.ModuleList(base)
        # Layer learns to scale the l2 normalized features from conv4_3
        self.L2Norm = L2Norm(512, 20)
        self.extras = nn.ModuleList(extras)

        self.loc = nn.ModuleList(head[0])
        self.conf = nn.ModuleList(head[1])
        self.decode_variance = cfg['variance']

        self.run_sequential = run_sequential

        if phase == 'test':
            self.softmax = nn.Softmax()
            self.det_transform = det_transform
            if det_transform is not None:
                self.detect = Detect(num_classes, 0, 200, 0.01, 0.45, cfg, False)
                self.detect = Detect(num_classes, 0, 200, 0.01, 0.45, cfg)
	def forward(self, x):
		"""Applies network layers and ops on input image(s) x.

			x: input image or batch of images. Shape: [batch,3*batch,300,300].

			Depending on phase:
				Variable(tensor) of output class label predictions,
				confidence score, and corresponding location predictions for
				each object detected. Shape: [batch,topk,7]

				list of concat outputs from:
					1: confidence layers, Shape: [batch*num_priors,num_classes]
					2: localization layers, Shape: [batch,num_priors*4]
					3: priorbox layers, Shape: [2,num_priors*4]

			sources = list()
			loc = list()
			conf = list()

			# apply vgg up to conv4_3 relu
			for k in range(23):
				x = self.vgg[k](x)

			s = self.L2Norm(x)

			# apply vgg up to fc7
			for k in range(23, len(self.vgg)):
				x = self.vgg[k](x)

			# apply extra layers and cache source layer outputs
			for k, v in enumerate(self.extras):
				x = F.relu(v(x), inplace=True)
				if k % 2 == 1:

			# apply multibox head to source layers
			for (x, l, c) in zip(sources, self.loc, self.conf):
				loc.append(l(x).permute(0, 2, 3, 1).contiguous())
				conf.append(c(x).permute(0, 2, 3, 1).contiguous())

			loc =[o.view(o.size(0), -1) for o in loc], 1)
			conf =[o.view(o.size(0), -1) for o in conf], 1)

			loc_preds = loc.view(loc.size(0), -1, 4)
			conf_preds = self.softmax(conf.view(-1, self.num_classes))
			boxes = self.priors.type(type(

		if self.phase == "test":
			output = self.detect(loc_preds, conf_preds, boxes)
			output = (
				loc.view(loc.size(0), -1, 4),
				conf.view(conf.size(0), -1, self.num_classes),
		return output

(This is a part of SSD PyTorch implementation, taken from

With batch size of 1, and image input size of 512x512, the memory in the first iteration reaches between 5-9 GB and after it stabilizes to about 1GB and stays like this in the next iterations.

I am confident that the memory allocation yields from the code above. Also, from my checks, it can be caused by different places in that code. Also, I have noticed that when I put time.sleep(1) between all operations in the code, the memory is not reaching the high value, and stabilize from the beginning (of course, it is taking forever to run the iteration :smile:).

From my understanding, it has something to do with CUDA asynchronous behavior, but I couldn’t figure out why and how I can prevent it (if at all…).

The reason I want to prevent it is for example for the case of running two models in parallel on the same GPU. This behavior is causing the second model not being able to allocate enough memory to start the run, although it has enough memory for the steady state phase (after the first iteration).

I would appreciate any help or insights.
Thanks in advance!


1 Like

Hi, I would really appreciate any help or feedback here… :pray:

anyone? please tell me if this look like a normal behaviour to you? have you encountered this phenomenon in the past?

I would really appreciate a response.