ResNet18 Not working on 2 GPUS

I need some help. I bit new to using higher batches sizes on the GPU. my data set s so big 500K images so I need to use bigger batchsize and larger net. But I was not successful until now to make this work.

train_loader = DataLoader(dset,batch_size=16,shuffle=True,num_workers=4)# pin_memory=True # CUDA only

test_loader = DataLoader(dset_test,batch_size=32,shuffle=False,num_workers=4)# pin_memory=True # CUDA only

use_cuda = torch.cuda.is_available()
model =torchvision.models.resnet18(pretrained='imagenet')
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 43)
if use_cuda:
    model.cuda()
    model = torch.nn.DataParallel(model, device_ids=range(torch.cuda.device_count()))
    cudnn.benchmark = True

But nothing seems to work. Everytime it’s throwing me a CUDA Memory error.
I want to know is my GPU is not big enough or what is my issue? I need a higher batch size.

Traceback (most recent call last):
  File "hckr_rnk.py", line 171, in <module>
    train(model, device, train_loader, optimizer, 50,test_loader)
  File "hckr_rnk.py", line 105, in train
    output = model(data)
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 123, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 133, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
    raise output
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
    output = module(*input, **kwargs)
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/models/resnet.py", line 144, in forward
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/models/resnet.py", line 77, in forward
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 66, in forward
    exponential_average_factor, self.eps)
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1254, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA error: out of memory

Use smaller batch size and try to set SGD as optimizer

Thank you for your response.

I Used SGD before it’s still the same and It works for the batch size of 10.

But I can’t decrease my batch size as my dataset is very large with 500k images almost and it does have 43 classes. If I am using 10 batch size it’s not learning well.

So, I am looking for anyway by which I can increase the batch size or want to know if my 20GB of 2 GPU’s memory is memory is not enough to run even the smallest Resnet.

Some optimizers requires memory to work. That’s why I recommend you to use SGD if you have memory issues. In addition I see that you use 10 times more memory in one gpu. Check you are not allocating garbage on the you.

Lastly, I would say you should be able to use a larger batch size. Check that imbalanced usage because it’s not normal.

Can you run even one iteration? Is the memory not being freed? You can you watch -n 0 nvidia-smi to check the memory in real time. If you can run few iterations before getting outnof memory it means you did a mistake in the code.

I don’t totally understand what you said. But what you said is correct.
I changed my optimizer to SGD. But my GPU memory is imbalanced I think just like you said. But I don’t know how to avoid that.

Second thing: My code doesn’t even run one iteration

Third Thing: Is the memory not being freed? I think No. Below is my screenshot.! (Top is the Nvidia Memory)(Below is my Error). Please help me and guide me what to do

Could you post the whole code?

Thank you. Sure. Below is my complete code

import numpy as np
import pandas as pd
from PIL import Image
import torch
import torchvision
from torch.utils.data.dataset import Dataset
from torch.utils.data import DataLoader
from torchvision import transforms
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
from sklearn.preprocessing import MultiLabelBinarizer
import torch.backends.cudnn as cudnn
import visdom
vis = visdom.Visdom()
vis.delete_env('hckrerth_cpu') #If you want to clear all the old plots for this python Experiments.Resets the Environment
vis = visdom.Visdom(env='hckrerth_cpu')
class KaggleAmazonDataset(Dataset):
    """Dataset wrapping images and target labels for Kaggle - Planet Amazon from Space competition.

    Arguments:
        A CSV file path
        Path to image folder
        Extension of images
        PIL transforms
    """

    def __init__(self, csv_path, img_path, img_ext, transform=None,train=True):
    
        tmp_df = pd.read_csv(csv_path)
        limit=len(tmp_df)
        if train:
            print("Loading Train Data")
            tmp_df=tmp_df.head(int(0.8*limit))
        else:
            print("Loadifn Test Data")
            tmp_df=tmp_df.tail(int(0.2*limit))
            tmp_df=tmp_df.reset_index(drop=True)

#         tmp_df['image_name']= tmp_df['image_name'].apply(lambda x: os.path.isfile(img_path + x)).all() 
#         print(tmp_df.head())
        #\"Some images referenced in the CSV file were not found"
        self.mlb = MultiLabelBinarizer()
        self.img_path = img_path
        self.img_ext = img_ext
        self.transform = transform

        self.X_train = tmp_df['image_name']
        self.y_train = tmp_df['tags']#self.mlb.fit_transform(tmp_df['tags'].str.split()).astype(np.float32)

    def __getitem__(self, index):
        img = Image.open(self.img_path + self.X_train[index])
        img = img.convert('RGB')
        if self.transform is not None:
            img = self.transform(img)
        
        label = (self.y_train[index])
        return img, label

    def __len__(self):
        return len(self.X_train.index)

task_test_accuracy=[]
def train(model, device, train_loader, optimizer, epochs,test_loader):
    options = dict(fillarea=True,width=400,height=400,xlabel='Batch_ID(Iterations)',ylabel='Loss',title='Train Loss')
    acc_options = dict(fillarea=True,width=400,height=400,xlabel='Epoch',ylabel='Accuracy',title='Test Acc')
    win = vis.line(X=np.array([0]),Y=np.array([0.7]),win='1',name='1',opts=options)
    model.train()
    total_batchid=0
    for epoch in range(0,epochs):
        running_loss = 0.0
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = nn.CrossEntropyLoss()(output, target)
            loss.mean().backward()
            optimizer.step()
            running_loss += loss.item()
            if (batch_idx+1) % 10 == 0:
                vis.line(X=np.array([total_batchid+batch_idx]),Y=np.array([loss.item()]),win=win,update='append')
                print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    epoch, batch_idx * len(data), len(train_loader.dataset),
                    100. * batch_idx / len(train_loader), loss.item()))
       # if (batch_idx+1) % 1000 == 0:
        total_batchid=total_batchid+batch_idx
        torch.save(model.state_dict(),'saved_models/model_'+str(epoch)+'_'+str(loss)+'.pt')
        accuracy=test(model,device,test_loader)
        task_test_accuracy.append(accuracy)
        vis.bar(X=np.array(task_test_accuracy),win='ACC',opts=acc_options)

def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            #print(output,target)
            test_loss += F.nll_loss(output, target, reduction='sum').item()
            pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))
    return 100. * correct / len(test_loader.dataset)


#transformations = transforms.Compose([transforms.ToTensor()])
transform = transforms.Compose([
    transforms.Resize(256),
   transforms.RandomSizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
    transforms.Normalize((0.5,0.5,0.5), (0.5,0.5,0.5))
])

dset = KaggleAmazonDataset('training-data/train.csv','training-data/train-images/','jpg',transform,train=True)
dset_test = KaggleAmazonDataset('training-data/train.csv','training-data/train-images/','jpg',transform,train=False)
train_loader = DataLoader(dset,batch_size=16,shuffle=True,num_workers=4)# pin_memory=True # CUDA only
test_loader = DataLoader(dset_test,batch_size=32,shuffle=False,num_workers=4)# pin_memory=True # CUDA only
use_cuda = torch.cuda.is_available()
model =torchvision.models.resnet18(pretrained='imagenet')
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 43)
if use_cuda:
    model.cuda()
    model = torch.nn.DataParallel(model, device_ids=range(torch.cuda.device_count()))
    cudnn.benchmark = True
device =torch.device("cuda" if use_cuda else "cpu")
#model = models.SimpleNet(num_classes=43).to(device)
#print(model)nvi
optimizer = optim.SGD(model.parameters(), lr=0.01,momentum=0.9)


# for epoch in range(1, 3):
    #test(model,device,test_loader)
train(model, device, train_loader, optimizer, 50,test_loader)

https://www.kaggle.com/utsav15/image-food [Dataset]

Hello, I am not entirely sure about this, but when you execute:

if use_cuda:
   model.cuda()
   model = torch.nn.DataParallel(model, device_ids=range(torch.cuda.device_count()))
   cudnn.benchmark = True

you are storing the entire model on GPU 0 (defaul cuda device) but then you are (when using DataParalell) storing a copy of the model on GPU 0 and GPU 1. Maybe that is why your GPU 0 is running out of memory so quickly. You should remove the model.cuda() line. Let me know if this helps.

Diego

Thank you for your answer.
Now it’s not throwing the memory error. But giving TypeError: Broadcast function not implemented for CPU tensors. I think the reason is because the model was never moved to CUDA memory

Traceback (most recent call last):
  File "hckr_rnk.py", line 173, in <module>
    train(model, device, train_loader, optimizer, 50,test_loader)
  File "hckr_rnk.py", line 105, in train
    output = model(data)
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 122, in forward
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 127, in replicate
    return replicate(module, device_ids)
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate
    param_copies = Broadcast.apply(devices, *params)
  File "/home/jmandivarapu1/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 13, in forward
    raise TypeError('Broadcast function not implemented for CPU tensors')
TypeError: Broadcast function not implemented for CPU tensors

I think you are correct. I oversaw one small part of your code the dataparalell line should look like this:

   model = torch.nn.DataParallel(model, device_ids=range(torch.cuda.device_count())).cuda()

Thus moving the model to GPU . Also you can move the data and labels to GPU like so:

data = data.cuda(async=True)
label = label.cuda(async=True)

Keep in mind that the .cuda() operation is not in place that is why you need to keep track of the value it returns

1 Like

Thank you. Unfortunately it’s not working. If i change my batch size anything greater than 12 it keeps saying cuda out of memory error.

import numpy as np
import pandas as pd
from PIL import Image
import torch
import torchvision
from torch.utils.data.dataset import Dataset
from torch.utils.data import DataLoader
from torchvision import transforms
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
from sklearn.preprocessing import MultiLabelBinarizer
import torch.backends.cudnn as cudnn
import visdom
vis = visdom.Visdom()
vis.delete_env('hckrerth_cpu') #If you want to clear all the old plots for this python Experiments.Resets the Environment
vis = visdom.Visdom(env='hckrerth_cpu')

class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self, in_planes, planes, stride=1):
        super(Bottleneck, self).__init__()
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)
        self.conv3 = nn.Conv2d(planes, self.expansion*planes, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(self.expansion*planes)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_planes != self.expansion*planes:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion*planes, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(self.expansion*planes)
            )

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out += self.shortcut(x)
        out = F.relu(out)
        return out

class KaggleAmazonDataset(Dataset):
    """Dataset wrapping images and target labels for Kaggle - Planet Amazon from Space competition.

    Arguments:
        A CSV file path
        Path to image folder
        Extension of images
        PIL transforms
    """

    def __init__(self, csv_path, img_path, img_ext, transform=None,train=True):
    
        tmp_df = pd.read_csv(csv_path)
        limit=len(tmp_df)
        if train:
            print("Loading Train Data")
            tmp_df=tmp_df.head(int(0.8*limit))
        else:
            print("Loadifn Test Data")
            tmp_df=tmp_df.tail(int(0.2*limit))
            tmp_df=tmp_df.reset_index(drop=True)

#         tmp_df['image_name']= tmp_df['image_name'].apply(lambda x: os.path.isfile(img_path + x)).all() 
#         print(tmp_df.head())
        #\"Some images referenced in the CSV file were not found"
        self.mlb = MultiLabelBinarizer()
        self.img_path = img_path
        self.img_ext = img_ext
        self.transform = transform

        self.X_train = tmp_df['image_name']
        self.y_train = tmp_df['tags']#self.mlb.fit_transform(tmp_df['tags'].str.split()).astype(np.float32)

    def __getitem__(self, index):
        img = Image.open(self.img_path + self.X_train[index])
        img = img.convert('RGB')
        if self.transform is not None:
            img = self.transform(img)
        
        label = (self.y_train[index])
        return img, label

    def __len__(self):
        return len(self.X_train.index)

task_test_accuracy=[]
def train(model, device, train_loader, optimizer, epochs,test_loader):
    options = dict(fillarea=True,width=400,height=400,xlabel='Batch_ID(Iterations)',ylabel='Loss',title='Train Loss')
    acc_options = dict(fillarea=True,width=400,height=400,xlabel='Epoch',ylabel='Accuracy',title='Test Acc')
    win = vis.line(X=np.array([0]),Y=np.array([0.7]),win='1',name='1',opts=options)
    model.train()
    total_batchid=0
    for epoch in range(0,epochs):
        running_loss = 0.0
        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = nn.CrossEntropyLoss()(output, target)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
            if (batch_idx+1) % 10 == 0:
                vis.line(X=np.array([total_batchid+batch_idx]),Y=np.array([loss.item()]),win=win,update='append')
                print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                    epoch, batch_idx * len(data), len(train_loader.dataset),
                    100. * batch_idx / len(train_loader), loss.item()))
       # if (batch_idx+1) % 1000 == 0:
        total_batchid=total_batchid+batch_idx
        torch.save(model.state_dict(),'saved_models/model_'+str(epoch)+'_'+str(loss)+'.pt')
        accuracy=test(model,device,test_loader)
        task_test_accuracy.append(accuracy)
        vis.bar(X=np.array(task_test_accuracy),win='ACC',opts=acc_options)

def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            #print(output,target)
            test_loss += F.nll_loss(output, target, reduction='sum').item()
            pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))
    return 100. * correct / len(test_loader.dataset)


#transformations = transforms.Compose([transforms.ToTensor()])
transform = transforms.Compose([
    transforms.Resize(256),
   transforms.RandomSizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
    transforms.Normalize((0.5,0.5,0.5), (0.5,0.5,0.5))
])

dset = KaggleAmazonDataset('training-data/train.csv','training-data/train-images/','jpg',transform,train=True)
dset_test = KaggleAmazonDataset('training-data/train.csv','training-data/train-images/','jpg',transform,train=False)
train_loader = DataLoader(dset,batch_size=13,shuffle=True,num_workers=4)# pin_memory=True # CUDA only
test_loader = DataLoader(dset_test,batch_size=32,shuffle=False,num_workers=4)# pin_memory=True # CUDA only
use_cuda = torch.cuda.is_available()
model =torchvision.models.resnet18(pretrained='imagenet')
num_ftrs = model.fc.in_features
model.fc = nn.Linear(num_ftrs, 43)
device =torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device is",device)
if use_cuda:
    #model.to(device)
    print("cahmp")
    model = torch.nn.DataParallel(model, device_ids=range(torch.cuda.device_count())).cuda()
    #cudnn.benchmark = True

#model = models.SimpleNet(num_classes=43).to(device)
#print(model)nvi
optimizer = optim.SGD(model.parameters(), lr=0.01,momentum=0.9)


# for epoch in range(1, 3):
    #test(model,device,test_loader)
train(model, device, train_loader, optimizer, 50,test_loader)

Can you post the nvidia-smi output when the model is not running?

It’s sad after running this code the memory doesn’t come to 0. Even though I am running nothing and it’s not allowing me to run any other of my existing other code (which is working previously).

When I run this file only the memory starts from 0 again goes until 10257MiB. Don’t know what to do.

This is not a Pytorch problem. It’s a problem with Python multiprocessing, sometimes, python workers wil not die if the python script fails. So all you need to do , is find these zombie python processes and kill them.
You can do this like so:

ps -elf | grep python
kill -9  [pid]

with pid being the process id of each of the python processes that the first command outputs. You can also use killall python. But I recommend the first option.

Thank i think what u said is correct and it’s working now.Basically what i did is i killed almost all the jupyter notebooks, python files and almost all.
It’s working now.

Thank you @Diego , @JuanFMontesinos for all u r responses

Glad I could help. :slight_smile: