Very slow training on GPU for LSTM NLP multiclass classification

Legolas · August 30, 2020, 2:57pm

Hi,

The training step of LSTM NN consumes 15+ min just for the first epoch. It seems I made a mistake somewhere.

def train_model(model, epochs=epochs_default_number, lr=lr_default_value):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optimizer = torch.optim.Adam(parameters, lr=lr)
    for i in range(epochs):
        model.train()
        sum_loss = 0.0
        total = 0
        for x, y, l in train_dl:
            x = x.long()
            y = y.long()
            y_pred = model(x, l)
            loss = F.cross_entropy(y_pred, y.to(device))
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            sum_loss += loss.item()*y.shape[0]
            total += y.shape[0]
        val_loss, val_acc, val_rmse = validation_metrics(model, val_dl)
        if i % 5 == 1:
            print("train loss %.3f, val loss %.3f, val accuracy %.3f, and val rmse %.3f" % (sum_loss/total, val_loss, val_acc, val_rmse))

def validation_metrics (model, valid_dl):
    model.eval()
    correct = 0
    total = 0
    sum_loss = 0.0
    sum_rmse = 0.0
    for x, y, l in valid_dl:
        x = x.long()
        y = y.long()
        y_hat = model(x, l)
        loss = F.cross_entropy(y_hat, y.to(device))
        pred = torch.max(y_hat, 1)[1]
        correct += (pred == y.to(device)).float().sum()
        total += y.shape[0]
        sum_loss += loss.item()*y.shape[0]
        sum_rmse += np.sqrt(mean_squared_error(pred.cpu(), y.unsqueeze(-1)))*y.shape[0]
    return sum_loss/total, correct/total, sum_rmse/total

class LSTM_fl(torch.nn.Module) :
    def __init__(self, vocab_size, embedding_dim, hidden_dim,number_of_layers) :
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers= number_of_layers, batch_first=True)
        self.linear = nn.Linear(hidden_dim, number_of_output_classes)
        self.dropout = nn.Dropout(dropout_value)

    def forward(self, x, l):
        x = self.embeddings(x.to(device))
        x = self.dropout(x.to(device))
        lstm_out, (ht, ct) = self.lstm(x.to(device))
        return self.linear(ht[-1])

model_fixed =  LSTM_fl(vocab_size, embedding_dim_value, hidden_dim_value, number_of_layers_value)
model_fixed = model_fixed.to(device)
train_model(model_fixed, epochs=epochs_number, lr=lr_value)

The values used are (I played with lr and batch_size but without success):
lr_value = 0.00001
lr_default_value = 0.001
epochs_number = 30
embedding_dim_value = 50
hidden_dim_value = 50
batch_size_value = 8
number_of_layers_value = 2
dropout_value = 0.2

Does anyone have advice on how to speed it up?

Thanks a lot

vdw · August 30, 2020, 11:29pm

Well, the training time per epoch depends on your dataset size. If you have a large dataset, 15min might the expected time. Or do you that each batch is slow?

You can certainly increase the batch size to 32 or 64. This will certainly give you an performance boost. In this case, you might also want to increase the learning rate a bit as well.

You also use a lot of unnecessary to() calls. Once a tensor is on the device, it stays there. for example, you can do

x = x.long().to(device)
y = y.long().to(device)

And remove all other to() calls, particular in the forward() method. Admittedly, I have no idea if this affects the performance at all. Probably not.

Otherwise, I can’t see anything wrong with your code, and it seems to train. Other things to consider might be the data, e.g., in case you have very long sequences.

Legolas · August 31, 2020, 6:16am

Hi,
thank you for your time and suggestions.
The new version of the code is like this:


def train_model(model, epochs=epochs_default_number, lr=lr_default_value):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optimizer = torch.optim.Adam(parameters, lr=lr)
    for i in range(epochs):
        model.train()
        sum_loss = 0.0
        total = 0
        for x, y, l in train_dl:
            x = x.long().to(device)
            y = y.long().to(device)
            y_pred = model(x, l)

            optimizer.zero_grad()
            loss = F.cross_entropy(y_pred, y)
            loss.backward()
            optimizer.step()
            sum_loss += loss.item()*y.shape[0]
            total += y.shape[0]
        val_loss, val_acc, val_rmse = validation_metrics(model, val_dl)
        if i % 5 == 1:
            print("train loss %.3f, val loss %.3f, val accuracy %.3f, and val rmse %.3f" % (sum_loss/total, val_loss, val_acc, val_rmse))
def validation_metrics (model, valid_dl):
    model.eval()
    correct = 0
    total = 0
    sum_loss = 0.0
    sum_rmse = 0.0
    for x, y, l in valid_dl:
        x = x.long().to(device)
        y = y.long().to(device)
        y_hat = model(x, l)
        loss = F.cross_entropy(y_hat, y)
        pred = torch.max(y_hat, 1)[1]
        correct += (pred == y).float().sum()
        total += y.shape[0]
        sum_loss += loss.item()*y.shape[0]
        sum_rmse += np.sqrt(mean_squared_error(pred.detach().cpu(), y.unsqueeze(-1).cpu()))*y.shape[0]
    return sum_loss/total, correct/total, sum_rmse/total
...
class LSTM_fl(torch.nn.Module) :
    def __init__(self, vocab_size, embedding_dim, hidden_dim,number_of_layers) :
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers= number_of_layers, batch_first=True)
        self.linear = nn.Linear(hidden_dim, number_of_output_classes)
        self.dropout = nn.Dropout(dropout_value)

    def forward(self, x, l):
        x = self.embeddings(x)
        x = self.dropout(x)
        lstm_out, (ht, ct) = self.lstm(x)
        return self.linear(ht[-1])

The dataset is not a large one - it has 80.000 rows. The text column has a max length of 80 words and the mean is 40. The learning rate and batch size are increased to 0.0001 and 64.
After these changes, the first epoch needed around 30 min and never seen the second one :-o. I had to stop it after one + hour.
I would appreciate any further suggestion

Thanks a lot

vdw · August 31, 2020, 12:53pm

Can you check how long the individual batches require? Particularly if time time increases from batch to batch.

Most LSTM/GRU examples I see – and what I usually do as well – is to manually reset the hidden state for each batch. For example, have a look at the PyTorch Seq2Seq Tutorial; search for the initHidden() method and when it’s called. However, you call x = self.lstm(x) without explicitly giving the hidden/cell state as input. I assume this implies that the hidden/cell state is automatically re-initialized. The thing is, if the hidden/cell state is not re-initialized, your computational graph for backgropagation grows and grows, which certainly causes performance issues.

Legolas · September 4, 2020, 9:57pm

Hi Chris,

thank you .
I have checked and the time increases from batch to batch.

Regarding resetting the hidden state, there is a post on the Pytorch forum hidden cell state which references docs: nn.LSTM take your full sequence (rather than chunks), automatically initializes the hidden and cell states to zeros, runs the lstm over your full sequence (updating state along the way) and returns a final list of outputs and final hidden/cell state.
I have tried with adding

def forward(self, x, l):
        x = self.embeddings(x)
        x = self.dropout(x)

        # Initialize hidden state with zeros
        h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_().to(device)
        # Initialize cell state
        c0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_().to(device)
        lstm_out, (ht, ct) = self.lstm(x, (h0.detach(), c0.detach()))
        lstm_out.size()
        lstm_out[:, -1, :]
        lstm_out = self.linear(lstm_out[:, -1, :])
        lstm_out.size()
        return self.linear(ht[-1])

Nevertheless, the time with every batch still increases.

What could be another reason for increasing time from batch to batch? I suppose that linked problem to this is that model has low training and validation accuracy (irrelevant varying around constant value)

Thank you

vdw · September 5, 2020, 7:27am

Hm, can’t see anything obvious what might cause the performance issues. Just some generic comments:

Since you initialize h0 and c0 you don’t need to detach them as well. Maybe that’s even bad for the training itself, but here I’m not sure.
I cannot see the point of the lines below. Again, I don’t think they hurt, but in the end you return the last hidden state after being pushed through a linear layer. lstm_out is never used
```
  lstm_out.size()
  lstm_out[:, -1, :]
  lstm_out = self.linear(lstm_out[:, -1, :])
  lstm_out.size()
```

I have a implementation of a multiclass LSTM/GRU classifier. Maybe you can have a look to see what might be different. The code is a bit verbose to make the classifier configurable.

Legolas · September 5, 2020, 6:59pm

Hi Chris,

Thank you for your time and suggestions.

I followed your advice, regarding detach() and four lines with lstm_out in forward().

I got some improvements in speed. I can finally proceed at least some small number of epochs until the end. However, proceeding time is increasing during training, and I would say there is much space for improvement, which I should implement.

For example, the processing time for the first few epochs is about: 30s, 90s, 250s, 370s, 490s. Then, it has more or less stable processing time around 400s until the end of the training. I have noticed better results when increasing batch_size, and these results are with batch_size = 1024.

Thank you for sharing your example. I have analyzed it and could not find what might be included in this example.

I wonder what else I can do to have a better proceeding time.