I experience some problem when I’m running my model in PyTorch.
In the first forward iteration, the GPU memory allocated is rising to a high value for a few moments and then stabilizes into a fixed value.
My code is:
class SSD(nn.Module): """Single Shot Multibox Architecture The network is composed of a base VGG network followed by the added multibox conv layers. Each multibox layer branches into 1) conv2d for class conf scores 2) conv2d for localization predictions 3) associated priorbox layer to produce default bounding boxes specific to the layer's feature map size. See: https://arxiv.org/pdf/1512.02325.pdf for more details. Args: phase: (string) Can be "test" or "train" base: VGG16 layers for input, size of either 300 or 500 extras: extra layers that feed to multibox loc and conf layers head: "multibox head" consists of loc and conf conv layers """ def __init__(self, phase, base, extras, head, num_classes, cfg, det_transform=None, run_sequential=False): super(SSD, self).__init__() self.phase = phase self.num_classes = num_classes # TODO: implement __call__ in PriorBox self.priorbox = PriorBox(cfg) self.priors = Variable(self.priorbox.forward(), volatile=True) self.size = 512 # SSD network self.vgg = nn.ModuleList(base) # Layer learns to scale the l2 normalized features from conv4_3 self.L2Norm = L2Norm(512, 20) self.extras = nn.ModuleList(extras) self.loc = nn.ModuleList(head) self.conf = nn.ModuleList(head) self.decode_variance = cfg['variance'] self.run_sequential = run_sequential if phase == 'test': self.softmax = nn.Softmax() self.det_transform = det_transform if det_transform is not None: self.detect = Detect(num_classes, 0, 200, 0.01, 0.45, cfg, False) else: self.detect = Detect(num_classes, 0, 200, 0.01, 0.45, cfg) def forward(self, x): """Applies network layers and ops on input image(s) x. Args: x: input image or batch of images. Shape: [batch,3*batch,300,300]. Return: Depending on phase: test: Variable(tensor) of output class label predictions, confidence score, and corresponding location predictions for each object detected. Shape: [batch,topk,7] train: list of concat outputs from: 1: confidence layers, Shape: [batch*num_priors,num_classes] 2: localization layers, Shape: [batch,num_priors*4] 3: priorbox layers, Shape: [2,num_priors*4] """ sources = list() loc = list() conf = list() # apply vgg up to conv4_3 relu for k in range(23): x = self.vgg[k](x) s = self.L2Norm(x) sources.append(s) # apply vgg up to fc7 for k in range(23, len(self.vgg)): x = self.vgg[k](x) sources.append(x) # apply extra layers and cache source layer outputs for k, v in enumerate(self.extras): x = F.relu(v(x), inplace=True) if k % 2 == 1: sources.append(x) # apply multibox head to source layers for (x, l, c) in zip(sources, self.loc, self.conf): loc.append(l(x).permute(0, 2, 3, 1).contiguous()) conf.append(c(x).permute(0, 2, 3, 1).contiguous()) loc = torch.cat([o.view(o.size(0), -1) for o in loc], 1) conf = torch.cat([o.view(o.size(0), -1) for o in conf], 1) loc_preds = loc.view(loc.size(0), -1, 4) conf_preds = self.softmax(conf.view(-1, self.num_classes)) boxes = self.priors.type(type(x.data)) break if self.phase == "test": output = self.detect(loc_preds, conf_preds, boxes) else: output = ( loc.view(loc.size(0), -1, 4), conf.view(conf.size(0), -1, self.num_classes), self.priors ) return output
(This is a part of SSD PyTorch implementation, taken from https://github.com/amdegroot/ssd.pytorch)
With batch size of 1, and image input size of 512x512, the memory in the first iteration reaches between 5-9 GB and after it stabilizes to about 1GB and stays like this in the next iterations.
I am confident that the memory allocation yields from the code above. Also, from my checks, it can be caused by different places in that code. Also, I have noticed that when I put time.sleep(1) between all operations in the code, the memory is not reaching the high value, and stabilize from the beginning (of course, it is taking forever to run the iteration ).
From my understanding, it has something to do with CUDA asynchronous behavior, but I couldn’t figure out why and how I can prevent it (if at all…).
The reason I want to prevent it is for example for the case of running two models in parallel on the same GPU. This behavior is causing the second model not being able to allocate enough memory to start the run, although it has enough memory for the steady state phase (after the first iteration).
I would appreciate any help or insights.
Thanks in advance!