OOM: Multi-label classification

I am training a VGG 16 model on a GTX 1080 (8GB memory) for multi-label classification on MSCOCO dataset. The memory usage for batch size of 21, 22, 23, 24 are 4533MB, 4597MB, 4681MB, 4759MB respectively. However, for batch size 25 or larger, CUDA is out-of-memory. The GPU has 8GB of RAM, and it has about half memory left. I am quite confused by why it is out-of-memory.

My training code is below

def train(model, dataset, learning_rate=0.001, batch_size=22, epochs=2):
    dataloader = DataLoader(dataset, batch_size=batch_size,shuffle=True, num_workers=4)
    criterion = nn.MultiLabelSoftMarginLoss() 
    optimizer = optim.Adam(model.parameters(), lr=learning_rate) 
    print_every = 20
    for epoch in range(epochs):  
        running_loss = 0.0
        batch_bar = tqdm(range(len(dataloader)))
        for i in batch_bar:
            # get the inputs
            inputs, labels = next(iter(dataloader))
            inputs = inputs.to('cuda:0')
            labels = labels.to('cuda:0')
            # forward + backward + optimize
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            running_loss += loss.item()
# Finally train the model
train(end2end, trainDataset, batch_size=40)

I have a similar but more generic problem: How to profile memory in Pytorch

From my experience, what you see in nvidia-smi is current memory. There might be some part of a computation where memory jumps, and quickly freed and you might not see it in nvidia-smi due to low update rate.