Memory Leak because of dropout layer?

Training a MLP. [70 6000 6000 4] network clearly overfits on data. Tried adding dropout layers to prevent overfitting. Ran into out of memory error.

Line of thought as follows:

  1. Too many parameters, hence not fitting in 12Gb Tesla K80 GPU

  2. Tried multi-GPU training by replacing if cuda: model.cuda() with model=nn.DataParallel(model).cuda(). Two 12Gb Tesla K80 GPUs run out of memory.

  3. Tried training very small model (below) on 1 GPU. Still out ran out of memory. This network is smaller than even a MNIST MLP.

Specs:

  1. Windows Server
  2. Cuda 8
  3. Python 3.6.1 (conda install)
  4. PyTorch v0.4.0
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.bn1 = nn.BatchNorm1d(70)
        self.fc1 = nn.Linear(70, 100)
        self.d1 = nn.Dropout(0.5)
        self.fc2 = nn.Linear(100, 100)
        self.d2 = nn.Dropout(0.5)
        self.fc3 = nn.Linear(100, 4)

    def forward(self, x):
        x = x.view(-1, 70)
        x = self.bn1(x)
        x = F.relu(self.fc1(x))
        x = self.d1(x)
        x = F.relu(self.fc2(x))
        x = self.d2(x)
        return F.log_softmax(self.fc3(x))


model = Net()
if cuda:
    model.cuda()

optimizer = optim.Adam(model.parameters(), lr=0.00001, betas=(0.9, 0.999), weight_decay=0.0)

Checked discussions of few other reported cases of memory leaks.

  1. Added torch.backends.cudnn.enabled = False after all import statements. Didn’t help
  2. volatile =True during validation

Any fixes?

2 Likes

Is your model with the Dropout working or are you getting the OOM error now all the time?
Maybe a process still lives and uses all your GPU memory? Could you check it? I don’t know, if nvidia-smi works on a windows machine.

The volatile flag is deprecated. Since you are using pytorch 0.4.0, you should use with torch.no_grad() instead.

Did you observe the memory? Is it growing once you add Dropout to your model?

OOM when I add dropout layers. Without dropout, I could fit [70 6000 6000 4] model on 1 GPU. (Any model with dropout layer is running into OOM)

I use os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" os.environ["CUDA_VISIBLE_DEVICES"]="0,1" to run it on specific GPUs. And no, these GPUs use 0 memory when experiments are not run. Checked via nvidia-smi

Yes, memory keeps growing when I try training models with dropout layer. I’ve trained bigger models (more parameters) on 1 12Gb GPU. But the model as small as one mentioned above, runs into OOM.

Noted

1 Like

I see that you are missing the if condition protection for the main process. Try again with that and see if that helps.

You mean like this: if cuda: model=nn.DataParallel(model).cuda() ?

Tried. Didn’t help

No, I mean you might protect your code with an idiom on Windows, like

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.bn1 = nn.BatchNorm1d(70)
        self.fc1 = nn.Linear(70, 100)
        self.d1 = nn.Dropout(0.5)
        self.fc2 = nn.Linear(100, 100)
        self.d2 = nn.Dropout(0.5)
        self.fc3 = nn.Linear(100, 4)

    def forward(self, x):
        x = x.view(-1, 70)
        x = self.bn1(x)
        x = F.relu(self.fc1(x))
        x = self.d1(x)
        x = F.relu(self.fc2(x))
        x = self.d2(x)
        return F.log_softmax(self.fc3(x))

if __name__ == '__main__':
    model = Net()
    if cuda:
        model.cuda()

    optimizer = optim.Adam(model.parameters(), lr=0.00001, betas=(0.9, 0.999), weight_decay=0.0)
    # and more code that is on the outer part without a protection

Memory usage increases at every iteration, for every epoch. (checked via nvidia-smi)

This is my main function now. I’m running it from jupyter notebook.

Tried with and without having data loaders in main

if __name__ == '__main__':
    
    train_data = torch.utils.data.TensorDataset(features_t,labels_t)
    train_loader = torch.utils.data.DataLoader(train_data, batch_size, shuffle=True)

    val_data = torch.utils.data.TensorDataset(features_v,labels_v)
    validation_loader = torch.utils.data.DataLoader(val_data, batch_size, shuffle=False)
    
    model = Net()
    if cuda:
        model = nn.DataParallel(model).cuda()

    optimizer = optim.Adam(model.parameters(), lr=0.00001,betas=(0.9, 0.999), weight_decay=0.0)
    
    %%time
    epochs = 6000

    losst, acct = [], []
    lossv, accv = [], []
    best_acc = 0.0
    for epoch in range(1, epochs + 1):

        train(epoch, losst, acct)
        a = validate(lossv, accv)
        print("Mean Class Acc" + str(a))

        is_best = a > best_acc
        best_acc = max(a, best_acc)
        save_checkpoint({
            'epoch': epoch + 1,
            'net':model,
            'state_dict': model.state_dict(),
            'best_prec1': best_acc,
            'optimizer' : optimizer.state_dict(),
        }, is_best)

@ash_gamma Could you please also post the functions of train and validate ? In 0.4.0, you should remember to use with torch.no_grad(): during interference.

def train(epoch, loss_vector_train, accuracy_vector_train ,log_interval=1000):
    model.train()
    train_loss, correct = 0, 0
    for batch_idx, (data, target) in enumerate(train_loader):
        if cuda:
            data, target = data.cuda(), target.cuda()
        data, target = Variable(data), Variable(target)
        optimizer.zero_grad()
        output = model(data)
        
        loss = F.nll_loss(output, target, weight=weights.cuda())
        train_loss += loss
        pred = output.data.max(1)[1]
        
        correct += pred.eq(target.data).cpu().sum()
        loss.backward()
        optimizer.step()
        
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
            epoch, batch_idx * len(data), len(train_loader.dataset),
            100. * batch_idx / len(train_loader), loss.data[0]))
            
        
    train_loss /= len(train_loader)
    loss_vector_train.append(train_loss)
    accuracy = 100. * correct / len(train_loader.dataset)
    accuracy_vector_train.append(accuracy)
    print(str(train_loss) + " " + str(accuracy))

def validate(loss_vector, accuracy_vector):
    model.eval()
    val_loss, correct = 0, 0
    confusion_matrix = torchnet.meter.ConfusionMeter(4)
    with torch.no_grad():
        for data, target in validation_loader:
            if cuda:
                data, target = data.cuda(), target.cuda()
            data, target = Variable(data), Variable(target)
            output = model(data)
            confusion_matrix.add(output.data, target.data)
            val_loss += F.nll_loss(output, target, weight=weights.cuda()).data[0]
            pred = output.data.max(1)[1] 
            correct += pred.eq(target.data).cpu().sum()

    val_loss /= len(validation_loader)
    loss_vector.append(val_loss)

    accuracy = 100. * correct / len(validation_loader.dataset)
    accuracy_vector.append(accuracy)
    cm = confusion_matrix.conf
    d = np.diag(cm)
    s = np.sum(cm,1)
    mean_class_acc = np.mean(d/s)
    print('\nValidation set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        val_loss, correct, len(validation_loader.dataset), accuracy))
    
    return mean_class_acc
1 Like

The use of train_loss += loss may be the cause. Try using train_loss += loss.item() instead.
Please refer to the latest example of MNIST to change your code.

2 Likes

I’ll try these changes.

But to let you know, I’m able to run training scripts without running into OOM when I do not have dropout layer. Shouldn’t train_loss += be a problem even without dropout then?

2 Likes

I don’t think the problem is on the dropout side. As it is used so many examples in PyTorch repo, users should have reported them since it’s now almost one month after the last release. And before the release, we have ran several benchmarks on various networks, which contains AlexNet that has dropout in it.

This was the issue! Thank you :slight_smile:

Interestingly, the same thing happened to me today.

The memory leak was also caused by a bad ‘+=’ between a float and a scalar tensor.
But the problem only appeared when I added a dropout layer.

same issue for me, and can be fixed by peterjc123’s method,it is really wierd:)