CUDA out of memory where num_workers >= 2

shirui-japina · June 11, 2020, 2:23pm

When I train my network,
it can work well when
num_worker = 0
or
num_worker = 1

But it will
CUDA out of memory
when
num_worker >= 2 .

How can I solve this problem?
Or to say, all I can do is to change to a better GPU only?

ptrblck · June 12, 2020, 8:28am

If you are loading the data onto the CPU (as would be the usual work flow), the number of workers should not change the usage of the GPU memory.
Could you post your Dataset and how you are using the DataLoader such that your device is running out of memory?

shirui-japina · June 13, 2020, 3:51am

Thanks for your comment.
I am referring https://github.com/amdegroot/ssd.pytorch.

Dataset:

class VOCDetection(data.Dataset):
    def __init__(self, root,
                 image_sets=[('2007', 'trainval'), ('2012', 'trainval')],
                 transform=None, target_transform=VOCAnnotationTransform(),
                 dataset_name='VOC0712'):
        self.root = root
        self.image_set = image_sets
        self.transform = transform
        self.target_transform = target_transform
        self.name = dataset_name
        self._annopath = osp.join('%s', 'Annotations', '%s.xml')
        self._imgpath = osp.join('%s', 'JPEGImages', '%s.jpg')
        self.ids = list()
        for (year, name) in image_sets:
            rootpath = osp.join(self.root, 'VOC' + year)
            for line in open(osp.join(rootpath, 'ImageSets', 'Main', name + '.txt')):
                self.ids.append((rootpath, line.strip()))

    def __getitem__(self, index):
        im, gt, h, w = self.pull_item(index)

        return im, gt

    def __len__(self):
        return len(self.ids)

    def pull_item(self, index):
        img_id = self.ids[index]

        target = ET.parse(self._annopath % img_id).getroot()
        img = cv2.imread(self._imgpath % img_id)
        height, width, channels = img.shape

        if self.target_transform is not None:
            target = self.target_transform(target, width, height)

        if self.transform is not None:
            target = np.array(target)
            img, boxes, labels = self.transform(img, target[:, :4], target[:, 4])
            # to rgb
            img = img[:, :, (2, 1, 0)]
            # img = img.transpose(2, 0, 1)
            target = np.hstack((boxes, np.expand_dims(labels, axis=1)))
        return torch.from_numpy(img).permute(2, 0, 1), target, height, width
        # return torch.from_numpy(img), target, height, width

    def pull_image(self, index):
        img_id = self.ids[index]
        return cv2.imread(self._imgpath % img_id, cv2.IMREAD_COLOR)

    def pull_anno(self, index):
        img_id = self.ids[index]
        anno = ET.parse(self._annopath % img_id).getroot()
        gt = self.target_transform(anno, 1, 1)
        return img_id[1], gt

    def pull_tensor(self, index):
        return torch.Tensor(self.pull_image(index)).unsqueeze_(0)

using the DataLoader:

    data_loader = data.DataLoader(dataset, args.batch_size,
                                  num_workers=args.num_workers,
                                  shuffle=True, collate_fn=detection_collate,
                                  pin_memory=True)
    # create batch iterator
    batch_iterator = iter(data_loader)
    for iteration in range(args.start_iter, cfg['max_iter']):
        if args.visdom and iteration != 0 and (iteration % epoch_size == 0):
            update_vis_plot(epoch, loc_loss, conf_loss, epoch_plot, None,
                            'append', epoch_size)
            # reset epoch loss counters
            loc_loss = 0
            conf_loss = 0
            epoch += 1

        if iteration in cfg['lr_steps']:
            step_index += 1
            adjust_learning_rate(optimizer, args.gamma, step_index)

        # load train data
        images, targets = next(batch_iterator)

        if args.cuda:
            images = Variable(images.cuda())
            targets = [Variable(ann.cuda(), volatile=True) for ann in targets]
        else:
            images = Variable(images)
            targets = [Variable(ann, volatile=True) for ann in targets]
        # forward
        t0 = time.time()
        out = net(images)
        # backprop
        optimizer.zero_grad()
        loss_l, loss_c = criterion(out, targets)
        loss = loss_l + loss_c
        loss.backward()
        optimizer.step()
        t1 = time.time()
        loc_loss += loss_l.data[0]
        conf_loss += loss_c.data[0]

        if iteration % 10 == 0:
            print('timer: %.4f sec.' % (t1 - t0))
            print('iter ' + repr(iteration) + ' || Loss: %.4f ||' % (loss.data[0]), end=' ')

        if args.visdom:
            update_vis_plot(iteration, loss_l.data[0], loss_c.data[0],
                            iter_plot, epoch_plot, 'append')

        if iteration != 0 and iteration % 5000 == 0:
            print('Saving state, iter:', iteration)
            torch.save(ssd_net.state_dict(), 'weights/ssd300_COCO_' +
                       repr(iteration) + '.pth')
    torch.save(ssd_net.state_dict(),
               args.save_folder + '' + args.dataset + '.pth')

ptrblck · June 13, 2020, 6:04am

Thanks for the code. I cannot see anything obviously wrong.

Could you create a simple DataLoader loop with num_workers=0 and num_workers>=2 and compare the memory usage via:

loader = Dataloader(dataset, num_workers=...)   # use 0 and >=2 in the next run
print(torch.cuda.memory_allocated()/1024**2)

for images, targets in loader:
    images = images.cuda()
    targets = target.cuda()
    print(torch.cuda.memory_allocated()/1024**2)

and post the result please?

PS: Variables are deprecated since PyTorch 0.4 and the usage is .data is discouraged, as it might have unwanted side effects. You can use tensors instead of Variables and call loss.detach(), if you want to store the tensor without the computation graph.

shirui-japina · June 13, 2020, 8:09am

Thanks for your comment.

num_workers=0

after
loader = Dataloader(dataset, num_workers=...)
torch.cuda.memory_allocated()/1024**2
103.033203125

after
for images, targets in loader:
torch.cuda.memory_allocated()/1024**2
119.5205078125
336.71630859375
336.76123046875
336.79931640625
…

num_workers=2

after
loader = Dataloader(dataset, num_workers=...)
torch.cuda.memory_allocated()/1024**2
103.033203125

after
for images, targets in loader:
torch.cuda.memory_allocated()/1024**2
119.5205078125
336.71630859375
336.76123046875
336.79931640625
…

And the problem I also found is, sometimes it can actually run even --num_workers=2 or --num_workers=4 in debug mode.

PyTorch version=1.5.0
PS: Maybe I should have to modify the codes like: Variables, .data.

ptrblck · June 13, 2020, 10:45pm

Thanks for the debugging.
It seems that the memory usage is not increased by the number of workers.

Yes, start with updating the code and let’s then continue to narrow down where the OOM issue is created.

Mona_Jalal · October 20, 2021, 3:03am

In my OOM case changing num_workers for the dataloader from 10 to 0 doesn’t change the situation in which OOM raises.

ptrblck · October 20, 2021, 5:27am

Good, since the GPU memory allocations should not be changed in the common use case where no CUDATensors are used in the Dataset.