Memory Leak because of dropout layer?

ash_gamma · May 16, 2018, 10:23pm

Training a MLP. [70 6000 6000 4] network clearly overfits on data. Tried adding dropout layers to prevent overfitting. Ran into out of memory error.

Line of thought as follows:

Too many parameters, hence not fitting in 12Gb Tesla K80 GPU
Tried multi-GPU training by replacing if cuda: model.cuda() with model=nn.DataParallel(model).cuda(). Two 12Gb Tesla K80 GPUs run out of memory.
Tried training very small model (below) on 1 GPU. Still out ran out of memory. This network is smaller than even a MNIST MLP.

Specs:

Windows Server
Cuda 8
Python 3.6.1 (conda install)
PyTorch v0.4.0

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.bn1 = nn.BatchNorm1d(70)
        self.fc1 = nn.Linear(70, 100)
        self.d1 = nn.Dropout(0.5)
        self.fc2 = nn.Linear(100, 100)
        self.d2 = nn.Dropout(0.5)
        self.fc3 = nn.Linear(100, 4)

    def forward(self, x):
        x = x.view(-1, 70)
        x = self.bn1(x)
        x = F.relu(self.fc1(x))
        x = self.d1(x)
        x = F.relu(self.fc2(x))
        x = self.d2(x)
        return F.log_softmax(self.fc3(x))


model = Net()
if cuda:
    model.cuda()

optimizer = optim.Adam(model.parameters(), lr=0.00001, betas=(0.9, 0.999), weight_decay=0.0)

Checked discussions of few other reported cases of memory leaks.

Added torch.backends.cudnn.enabled = False after all import statements. Didn’t help
volatile =True during validation

Any fixes?

ptrblck · May 16, 2018, 11:40pm

Is your model with the Dropout working or are you getting the OOM error now all the time?
Maybe a process still lives and uses all your GPU memory? Could you check it? I don’t know, if nvidia-smi works on a windows machine.

The volatile flag is deprecated. Since you are using pytorch 0.4.0, you should use with torch.no_grad() instead.

Did you observe the memory? Is it growing once you add Dropout to your model?

ash_gamma · May 16, 2018, 11:51pm

OOM when I add dropout layers. Without dropout, I could fit [70 6000 6000 4] model on 1 GPU. (Any model with dropout layer is running into OOM)

I use os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" os.environ["CUDA_VISIBLE_DEVICES"]="0,1" to run it on specific GPUs. And no, these GPUs use 0 memory when experiments are not run. Checked via nvidia-smi

Yes, memory keeps growing when I try training models with dropout layer. I’ve trained bigger models (more parameters) on 1 12Gb GPU. But the model as small as one mentioned above, runs into OOM.

Noted

peterjc123 · May 17, 2018, 4:45am

I see that you are missing the if condition protection for the main process. Try again with that and see if that helps.

ash_gamma · May 17, 2018, 11:20am

You mean like this: if cuda: model=nn.DataParallel(model).cuda() ?

Tried. Didn’t help

peterjc123 · May 17, 2018, 12:24pm

No, I mean you might protect your code with an idiom on Windows, like

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.bn1 = nn.BatchNorm1d(70)
        self.fc1 = nn.Linear(70, 100)
        self.d1 = nn.Dropout(0.5)
        self.fc2 = nn.Linear(100, 100)
        self.d2 = nn.Dropout(0.5)
        self.fc3 = nn.Linear(100, 4)

    def forward(self, x):
        x = x.view(-1, 70)
        x = self.bn1(x)
        x = F.relu(self.fc1(x))
        x = self.d1(x)
        x = F.relu(self.fc2(x))
        x = self.d2(x)
        return F.log_softmax(self.fc3(x))

if __name__ == '__main__':
    model = Net()
    if cuda:
        model.cuda()

    optimizer = optim.Adam(model.parameters(), lr=0.00001, betas=(0.9, 0.999), weight_decay=0.0)
    # and more code that is on the outer part without a protection

ash_gamma · May 17, 2018, 1:05pm

Memory usage increases at every iteration, for every epoch. (checked via nvidia-smi)

This is my main function now. I’m running it from jupyter notebook.

Tried with and without having data loaders in main

if __name__ == '__main__':
    
    train_data = torch.utils.data.TensorDataset(features_t,labels_t)
    train_loader = torch.utils.data.DataLoader(train_data, batch_size, shuffle=True)

    val_data = torch.utils.data.TensorDataset(features_v,labels_v)
    validation_loader = torch.utils.data.DataLoader(val_data, batch_size, shuffle=False)
    
    model = Net()
    if cuda:
        model = nn.DataParallel(model).cuda()

    optimizer = optim.Adam(model.parameters(), lr=0.00001,betas=(0.9, 0.999), weight_decay=0.0)
    
    %%time
    epochs = 6000

    losst, acct = [], []
    lossv, accv = [], []
    best_acc = 0.0
    for epoch in range(1, epochs + 1):

        train(epoch, losst, acct)
        a = validate(lossv, accv)
        print("Mean Class Acc" + str(a))

        is_best = a > best_acc
        best_acc = max(a, best_acc)
        save_checkpoint({
            'epoch': epoch + 1,
            'net':model,
            'state_dict': model.state_dict(),
            'best_prec1': best_acc,
            'optimizer' : optimizer.state_dict(),
        }, is_best)

peterjc123 · May 17, 2018, 1:40pm

@ash_gamma Could you please also post the functions of train and validate ? In 0.4.0, you should remember to use with torch.no_grad(): during interference.

ash_gamma · May 17, 2018, 1:52pm

def train(epoch, loss_vector_train, accuracy_vector_train ,log_interval=1000):
    model.train()
    train_loss, correct = 0, 0
    for batch_idx, (data, target) in enumerate(train_loader):
        if cuda:
            data, target = data.cuda(), target.cuda()
        data, target = Variable(data), Variable(target)
        optimizer.zero_grad()
        output = model(data)
        
        loss = F.nll_loss(output, target, weight=weights.cuda())
        train_loss += loss
        pred = output.data.max(1)[1]
        
        correct += pred.eq(target.data).cpu().sum()
        loss.backward()
        optimizer.step()
        
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
            epoch, batch_idx * len(data), len(train_loader.dataset),
            100. * batch_idx / len(train_loader), loss.data[0]))
            
        
    train_loss /= len(train_loader)
    loss_vector_train.append(train_loss)
    accuracy = 100. * correct / len(train_loader.dataset)
    accuracy_vector_train.append(accuracy)
    print(str(train_loss) + " " + str(accuracy))

def validate(loss_vector, accuracy_vector):
    model.eval()
    val_loss, correct = 0, 0
    confusion_matrix = torchnet.meter.ConfusionMeter(4)
    with torch.no_grad():
        for data, target in validation_loader:
            if cuda:
                data, target = data.cuda(), target.cuda()
            data, target = Variable(data), Variable(target)
            output = model(data)
            confusion_matrix.add(output.data, target.data)
            val_loss += F.nll_loss(output, target, weight=weights.cuda()).data[0]
            pred = output.data.max(1)[1] 
            correct += pred.eq(target.data).cpu().sum()

    val_loss /= len(validation_loader)
    loss_vector.append(val_loss)

    accuracy = 100. * correct / len(validation_loader.dataset)
    accuracy_vector.append(accuracy)
    cm = confusion_matrix.conf
    d = np.diag(cm)
    s = np.sum(cm,1)
    mean_class_acc = np.mean(d/s)
    print('\nValidation set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        val_loss, correct, len(validation_loader.dataset), accuracy))
    
    return mean_class_acc

peterjc123 · May 17, 2018, 2:16pm

The use of train_loss += loss may be the cause. Try using train_loss += loss.item() instead.
Please refer to the latest example of MNIST to change your code.

ash_gamma · May 17, 2018, 2:20pm

I’ll try these changes.

But to let you know, I’m able to run training scripts without running into OOM when I do not have dropout layer. Shouldn’t train_loss += be a problem even without dropout then?

peterjc123 · May 17, 2018, 2:48pm

I don’t think the problem is on the dropout side. As it is used so many examples in PyTorch repo, users should have reported them since it’s now almost one month after the last release. And before the release, we have ran several benchmarks on various networks, which contains AlexNet that has dropout in it.

ash_gamma · May 17, 2018, 3:39pm

This was the issue! Thank you

Cerbal · August 19, 2018, 1:01pm

Interestingly, the same thing happened to me today.

The memory leak was also caused by a bad ‘+=’ between a float and a scalar tensor.
But the problem only appeared when I added a dropout layer.

OaDsis · May 13, 2019, 1:51pm

same issue for me, and can be fixed by peterjc123’s method,it is really wierd:)