RuntimeError: Trying to backward through the graph a second time [...]

giorgoskeme · July 2, 2024, 6:20pm

Hi, I am trying to build an LSTM model. I am encountering this error:

File “spam_detection_LSTM_01.py”, line 131, in
loss.backward()
File “/gpfs/software/Anaconda/envs/pytorch-latest/lib/python3.8/site-packages/torch/_tensor.py”, line 487, in backward
torch.autograd.backward(
File “/gpfs/software/Anaconda/envs/pytorch-latest/lib/python3.8/site-packages/torch/autograd/init.py”, line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

I have read several threads here with possible fixes and I have tried all of them arleady (hopefully correctly). However, the error still persists. My model class currently looks like this:

class Network(nn.Module):
    def __init__(self, input_size=40, hidden_size=10, num_layers=1, batch_first=True):
        super(Network, self).__init__()
        self.hidden_size = hidden_size
        # our LSTM model
        self.LSTM = LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=batch_first)
        # final fully connected layers
        self.fc1 = Linear(10, 5)
        self.fc2 = Linear(5, 1)
    
    def forward(self, x):
        h0 = torch.zeros(1, x.size(0), self.hidden_size)
        c0 = torch.zeros(1, x.size(0), self.hidden_size)
        output, hidden = self.LSTM(x, (h0.detach(), c0.detach()))
        x = F.relu(self.fc1(output[:,-1,:].view(x.size(0), -1)))
        x = torch.sigmoid(self.fc2(x))
        return x.flatten(), hidden

I have tried many variations (for instance not including the “.detach()” methods. I still get the same error. My training loop looks like this:

start = time.time()
# training
for epoch in range(epochs):
    running_loss = 0.0
    # hidden = model.init_hidden(batch_size)
    for i, data in enumerate(train_loader, 0):
        current = time.time()
        print(i, (current-start)/60)
        inputs, labels = data

        outputs, hidden = model(inputs)
        loss = criterion(outputs, labels.float())
        # running_loss += loss.item()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print('Finished Training')

Does anyone know what’s wrong with my code? Where am I “trying to backward through the graph a second time”? Thanks in advance!

ptrblck · July 2, 2024, 10:45pm

Your code works fine for me after adding the missing pieces:

class Network(nn.Module):
    def __init__(self, input_size=40, hidden_size=10, num_layers=1, batch_first=True):
        super(Network, self).__init__()
        self.hidden_size = hidden_size
        # our LSTM model
        self.LSTM = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=batch_first)
        # final fully connected layers
        self.fc1 = nn.Linear(10, 5)
        self.fc2 = nn.Linear(5, 1)
    
    def forward(self, x):
        h0 = torch.zeros(1, x.size(0), self.hidden_size)
        c0 = torch.zeros(1, x.size(0), self.hidden_size)
        output, hidden = self.LSTM(x, (h0.detach(), c0.detach()))
        x = F.relu(self.fc1(output[:,-1,:].view(x.size(0), -1)))
        x = torch.sigmoid(self.fc2(x))
        return x, hidden


model = Network()
N = 160
x = torch.randn(N, 20, 40)
dataset = TensorDataset(x, torch.randint(0, 2, (N, 1)))
train_loader = DataLoader(dataset, batch_size=8)

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

for epoch in range(100):
    for i, data in enumerate(train_loader, 0):
        inputs, labels = data
    
        outputs, hidden = model(inputs)
        loss = criterion(outputs, labels.float())
    
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print("epoch {}, loss: {}".format(epoch, loss.item()))

print('Finished Training')

# epoch 0, loss: 0.708498477935791
# epoch 1, loss: 0.7076647877693176
# epoch 2, loss: 0.7070136070251465
# ...
# epoch 97, loss: 0.0009400603594258428
# epoch 98, loss: 0.0009146835654973984
# epoch 99, loss: 0.000890372262801975
# Finished Training

giorgoskeme · July 3, 2024, 5:52am

Thank you for the quick reply. The model was actually fine, as you noted. The issue originated from the nature and type of my data, becaue I had tried to write non-torch data to torch tensors. Running your code and getting no error convinced me it’s not the model’s or training’s fault. Thanks again!