GPU Training becomes very slow after a few iterations

Matheusih · April 9, 2019, 2:24pm

Hi, I’m training a model consisting of a nn.LSTM connected into a nn.Linear, using this for a regression problem. However training becomes real slow even inside the first epoch. I think my problem is that I’m using retrain_graph=True and also that the graph of my hidden state grows at each iteration, as I refeed it to my model every iteration. If I try not using retrain_graph=True it throws me an error “trying to backward through the graph a second time but the buffers have already been freed”. Since I’m new to RNNs I’m a bit confused, is it supposed to be like this, or am I missing something in my approach?

Here’s the relevant code:

def learn(X, y, hidden):
model.zero_grad()

output, hidden = model(X, hidden)

loss = criterion(output, y)

optimizer.zero_grad()
loss.backward(retain_graph=True)
optimizer.step()

return loss, output, hidden

hidden = None
for epoch in range(epochs):
loss_avg = 0
for i, (X,y) in enumerate(dataloader):
X = X.to(device)
y = y.to(device)

loss, output, hidden = learn(X,y, hidden)
 ...

And the definition of my model follows:

class RNN(nn.Module):
def init(self, input_size, hidden_size, output_size = 1, num_layers = 1, batch_size = 1):
super(RNN, self).init()

    self.hidden_size = hidden_size
    self.input_size = input_size
    self.batch_size = batch_size
    
    self.lstm = nn.LSTM(input_size, hidden_size, num_layers)
    
    self.mlp = nn.Linear(hidden_size, output_size)

def forward(self, X, H=None):
  
    out, hidden = self.lstm(X,H)
    out = self.mlp(out[-1])
    return out, hidden

EDIT (Solved):
The problem was that I was non-intentionally training my hidden and saving all its graphs during training. I solved this by detaching the graph before returning, inside my learn function.

pzg · October 1, 2019, 8:05pm

Hello I appear to be running into the same problem. Ive tried detach() on both hidden and model which I return to no effect. What exactly did you find helpful to detach?