When I train my network,
it can work well when
num_worker = 0
or
num_worker = 1
But it will
CUDA out of memory
when
num_worker >= 2
.
How can I solve this problem?
Or to say, all I can do is to change to a better GPU only?
When I train my network,
it can work well when
num_worker = 0
or
num_worker = 1
But it will
CUDA out of memory
when
num_worker >= 2
.
How can I solve this problem?
Or to say, all I can do is to change to a better GPU only?
If you are loading the data onto the CPU (as would be the usual work flow), the number of workers should not change the usage of the GPU memory.
Could you post your Dataset
and how you are using the DataLoader
such that your device is running out of memory?
Thanks for your comment.
I am referring https://github.com/amdegroot/ssd.pytorch.
Dataset
:class VOCDetection(data.Dataset):
def __init__(self, root,
image_sets=[('2007', 'trainval'), ('2012', 'trainval')],
transform=None, target_transform=VOCAnnotationTransform(),
dataset_name='VOC0712'):
self.root = root
self.image_set = image_sets
self.transform = transform
self.target_transform = target_transform
self.name = dataset_name
self._annopath = osp.join('%s', 'Annotations', '%s.xml')
self._imgpath = osp.join('%s', 'JPEGImages', '%s.jpg')
self.ids = list()
for (year, name) in image_sets:
rootpath = osp.join(self.root, 'VOC' + year)
for line in open(osp.join(rootpath, 'ImageSets', 'Main', name + '.txt')):
self.ids.append((rootpath, line.strip()))
def __getitem__(self, index):
im, gt, h, w = self.pull_item(index)
return im, gt
def __len__(self):
return len(self.ids)
def pull_item(self, index):
img_id = self.ids[index]
target = ET.parse(self._annopath % img_id).getroot()
img = cv2.imread(self._imgpath % img_id)
height, width, channels = img.shape
if self.target_transform is not None:
target = self.target_transform(target, width, height)
if self.transform is not None:
target = np.array(target)
img, boxes, labels = self.transform(img, target[:, :4], target[:, 4])
# to rgb
img = img[:, :, (2, 1, 0)]
# img = img.transpose(2, 0, 1)
target = np.hstack((boxes, np.expand_dims(labels, axis=1)))
return torch.from_numpy(img).permute(2, 0, 1), target, height, width
# return torch.from_numpy(img), target, height, width
def pull_image(self, index):
img_id = self.ids[index]
return cv2.imread(self._imgpath % img_id, cv2.IMREAD_COLOR)
def pull_anno(self, index):
img_id = self.ids[index]
anno = ET.parse(self._annopath % img_id).getroot()
gt = self.target_transform(anno, 1, 1)
return img_id[1], gt
def pull_tensor(self, index):
return torch.Tensor(self.pull_image(index)).unsqueeze_(0)
DataLoader
: data_loader = data.DataLoader(dataset, args.batch_size,
num_workers=args.num_workers,
shuffle=True, collate_fn=detection_collate,
pin_memory=True)
# create batch iterator
batch_iterator = iter(data_loader)
for iteration in range(args.start_iter, cfg['max_iter']):
if args.visdom and iteration != 0 and (iteration % epoch_size == 0):
update_vis_plot(epoch, loc_loss, conf_loss, epoch_plot, None,
'append', epoch_size)
# reset epoch loss counters
loc_loss = 0
conf_loss = 0
epoch += 1
if iteration in cfg['lr_steps']:
step_index += 1
adjust_learning_rate(optimizer, args.gamma, step_index)
# load train data
images, targets = next(batch_iterator)
if args.cuda:
images = Variable(images.cuda())
targets = [Variable(ann.cuda(), volatile=True) for ann in targets]
else:
images = Variable(images)
targets = [Variable(ann, volatile=True) for ann in targets]
# forward
t0 = time.time()
out = net(images)
# backprop
optimizer.zero_grad()
loss_l, loss_c = criterion(out, targets)
loss = loss_l + loss_c
loss.backward()
optimizer.step()
t1 = time.time()
loc_loss += loss_l.data[0]
conf_loss += loss_c.data[0]
if iteration % 10 == 0:
print('timer: %.4f sec.' % (t1 - t0))
print('iter ' + repr(iteration) + ' || Loss: %.4f ||' % (loss.data[0]), end=' ')
if args.visdom:
update_vis_plot(iteration, loss_l.data[0], loss_c.data[0],
iter_plot, epoch_plot, 'append')
if iteration != 0 and iteration % 5000 == 0:
print('Saving state, iter:', iteration)
torch.save(ssd_net.state_dict(), 'weights/ssd300_COCO_' +
repr(iteration) + '.pth')
torch.save(ssd_net.state_dict(),
args.save_folder + '' + args.dataset + '.pth')
Thanks for the code. I cannot see anything obviously wrong.
Could you create a simple DataLoader
loop with num_workers=0
and num_workers>=2
and compare the memory usage via:
loader = Dataloader(dataset, num_workers=...) # use 0 and >=2 in the next run
print(torch.cuda.memory_allocated()/1024**2)
for images, targets in loader:
images = images.cuda()
targets = target.cuda()
print(torch.cuda.memory_allocated()/1024**2)
and post the result please?
PS: Variables
are deprecated since PyTorch 0.4
and the usage is .data
is discouraged, as it might have unwanted side effects. You can use tensors instead of Variables
and call loss.detach()
, if you want to store the tensor without the computation graph.
Thanks for your comment.
num_workers=0
after
loader = Dataloader(dataset, num_workers=...)
torch.cuda.memory_allocated()/1024**2
103.033203125
after
for images, targets in loader:
torch.cuda.memory_allocated()/1024**2
119.5205078125
336.71630859375
336.76123046875
336.79931640625
…
num_workers=2
after
loader = Dataloader(dataset, num_workers=...)
torch.cuda.memory_allocated()/1024**2
103.033203125
after
for images, targets in loader:
torch.cuda.memory_allocated()/1024**2
119.5205078125
336.71630859375
336.76123046875
336.79931640625
…
And the problem I also found is, sometimes it can actually run even --num_workers=2
or --num_workers=4
in debug mode.
PyTorch version=1.5.0
PS: Maybe I should have to modify the codes like: Variables
, .data
.
Thanks for the debugging.
It seems that the memory usage is not increased by the number of workers.
Yes, start with updating the code and let’s then continue to narrow down where the OOM issue is created.
In my OOM case changing num_workers for the dataloader from 10 to 0 doesn’t change the situation in which OOM raises.
Good, since the GPU memory allocations should not be changed in the common use case where no CUDATensors
are used in the Dataset
.