Out of Memoery error

Hi, I am frequently going out of memory since I am working on mages and computer vision tasks like segmentation. I tried apex before but sometimes it shows really vired behavior that I can’t understand the reason. I tried also DataPrallel that doesn’t help too much. The last solution was DistributedDataParallel. For this, the documents of PyTorch were very confusing and I used some other tutorials but I haven’t still succeded to use it. Is there anyone to have other solutions or tricks to get ride of OOM error. I also deleted the tensors that I didn’t need anymore.

Thank you in advance for your help.

Can you tell us the model you are using, image size, and batch size?

What is your batch size, you can use small batch size will solve the problem, and also how many GPUs are you using ?
if you share the code, we could help

I am using carnava dataset that has images of size (1280, 1918, 3) deeplabv3, with resnet101 as backbone. The batch size is 2, since the imges are really large.
Here is the complete code.

class carnava(Dataset):
   def __init__(self,path,transform):
        self.path = path
        self.PATH2 =os.path.join(self.path,'train_masks_png')
        img_list = [img for img in os.listdir(os.path.join(self.path,'train'))]
        mask_list = [mask for mask in os.listdir(os.path.join(self.path,'train_masks'))]
        mask_png_list = [Image.open(os.path.join(self.path,'train_masks',ms)).save(os.path.join(self.PATH2,f'{ms[:-4]}.png')) 
                         for ms in mask_list]
        self.img_list= img_list
        self.mask_list= mask_list
        self.mask_png_list = mask_png_list
        self.transform = transform
        self.len= len(img_list)
  def __getitem__(self,index):
        img = Image.open(os.path.join(self.path,'train',self.img_list[index]))
        mask = Image.open(os.path.join(self.PATH2,f'{self.mask_list[index][:-4]}.png'))    
        data = {'image':np.array(img),'mask':np.array(mask)}
        data_tr = self.transform(**data)
        img_tr = data_tr['image']
        mask_tr = data_tr['mask']
        return(img_tr,mask_tr)
        def __len__(self):
        return(self.len)
train_loader= DataLoader(train_set, batch_size =2, shuffle=True,num_workers=4)
val_loader = DataLoader(val_set,batch_size=4, shuffle=False, num_workers=4)
model = deeplabv3_resnet101(pretrained=True, progress=True)
model.classifier[4]= nn.Conv2d(in_channels=256,out_channels=1,kernel_size=1,stride=1)
def dice_loss(pred, target):
    smooth = 1.
    iflat = pred.contiguous().view(-1)
    tflat = target.contiguous().view(-1)
    intersection = (iflat * tflat).sum()
    A_sum = torch.sum(iflat * iflat)
    B_sum = torch.sum(tflat * tflat)
    return 1 - ((2. * intersection + smooth) / (A_sum + B_sum + smooth) )
def train(epoch,model,optimizer,criterion,device,phase):
    if phase=='train':
        model.train()
        dataloader = train_loader  
    else:
        model.eval()
        dataloader = val_loader
    for i, (img, mask) in enumerate(dataloader):
        img = img.to(device)
        mask = mask.to(device)
        target = model(img)['out']
        loss = criterion(target,mask)
        loss_total += loss.item()
        if phase=='train':
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        target=target.detach().to(device)
        dis_loss += dice_loss(target,mask)
    return(loss_total/len(train_loader), dis_loss/len(train_loader))
def main(num_epochs):
    device= torch.device(5)
    model = get_model('deeplab').to(device)
    optimizer = optim.Adam(model.parameters(),lr=0.1)
    criterion =torch.nn.BCEWithLogitsLoss().to(device)
    for epoch in range(num_epochs):
        train_loss,dic_loss_t = train(epoch,model,optimizer,criterion,device,'train')
        val_loss, dis_loss_v = train(epoch,odel,optimizer,criterion,device,'val')
        
if __name__=="__main__":
    main(10)

CUDA out of memory. Tried to allocate 2.00 MiB (GPU 5; 10.76 GiB total capacity; 112.99 MiB already allocated; 5.25 MiB free; 118.00 MiB reserved in total by PyTorch)

OOM when using DDP would lead to de-synchronization across processes in the same group, which in turn would cause hang or crash. TorchElastic is built to solve this problem. When there is an error, it will destroy all DDP instances on all processes, and then re-construct a new gang. Is this sufficient for your use case?

So, you mean that I can use DistributedDataParallel, and TorchElastic for solving a crash in case it happens?

Yes, it should. But you might need some code change in the application to configure TorchElastic and comply to its API. cc TorchElastic author @Kiuk_Chung

Thanks @mrshenli! @887574002, there’s not much in your script you’d have to do to be “TorchElastic compliant” take a look at http://pytorch.org/elastic/0.2.0/train_script.html for the details. In order to minimize work lost, you’d have to checkpoint at an interval and ensure that your train script can load from the “most recent checkpoint”.

Aside from this, you’d have to setup an etcd server which is usually trivial for most users but depends on your specific runtime environment.

One common thing I have faced with Error: Out of Memory is having a lot of num_workers (>2). The more the number of workers in the data-loader the more memory it takes and I have seen the memory usage increases drastically.