Why is my gradient accumulation failing?

Goldname · February 12, 2024, 7:49pm

For debugging I switched to a simple example based off pytorch training a classifier tutorial.

import torch
import torchvision
import torchvision.transforms as transforms
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

batch_size = 4

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()



import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)


inp = torch.empty(0, 10)
lab = torch.empty(0, dtype=torch.int64)

for epoch in range(2):  # loop over the dataset multiple times
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        outputs = net(inputs)
        inp = torch.cat((inp, outputs), dim=0)
        lab = torch.cat((lab, labels), dim=0)
        if i % 101 == 100:
            # zero the parameter gradients
            optimizer.zero_grad()
            # forward + backward + optimize
            loss = criterion(inp, lab)
            loss.backward()
            optimizer.step()
            print(loss)


print('Finished Training')

I get this error:

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

Why am I getting this error? I can fix it by adding loss.backward(retain_graph=True)

But then I get a new error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [84, 10]], which is output 0 of AsStridedBackward0, is at version 2521; expected version 2520 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

ptrblck · February 12, 2024, 8:49pm

That’s not a fix and will just raise the other mentioned error.
You are running into:

RuntimeError: Trying to backward through the graph a second time ...

since you are accumulating the computation graphs between epochs.
Reset inp and lab after backward was called and it should work.

Goldname · February 13, 2024, 8:38pm

Thanks a lot! This fixes my example code but unfortunately in my original code I had that line in (just forgot to copy it to this example).

As an aside, in my sample code now the loss is barely decreasing. Do you know why this may be?

New code:


out = torch.empty(0, 10)
lab = torch.empty(0, dtype=torch.int64)
net = Net()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

accum = 4

for epoch in range(2):  # loop over the dataset multiple times
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        outputs = net(inputs)
        out = torch.cat((out, outputs), dim=0)
        lab = torch.cat((lab, labels), dim=0)
        if i % accum == accum - 1:
            optimizer.zero_grad()
            loss = criterion(out, lab)
            loss.backward()
            optimizer.step()
            out = torch.empty(0, 10)
            lab = torch.empty(0, dtype=torch.int64)

            running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000}')
            running_loss = 0.0



print('Finished Training')

My evaluation shows a very low accuracy after doing this. (When I directly increase the batch size in the data loader without using gradient accumulation everything works just fine). Do you see any problems?

J_Johnson · February 13, 2024, 10:26pm

When using gradient accumulation, you can adjust your learning rate upward. Try an lr of 0.1 or 0.05.

Goldname · February 13, 2024, 11:50pm

Why do I need to adjust my learning rate for gradient accumulation but not for a larger batch size? I understand that there are less optimization steps with a larger batch size, so I am comparing accumulation + small batch to no accumulation + large batch

J_Johnson · February 14, 2024, 1:55am

You can also increase the learning rate for a larger batch size. Any time you have a better representation size of the data, you can increase the learning rate proportionately.