CUDA out of memory error when training a simple BiLSTM

Hi all,

I´m new to PyTorch, and I’m trying to train (on a GPU) a simple BiLSTM for a regression task. I have 65 features and the shape of my training set is (1969875, 65). The specific architecture of my model is:

LSTM(
  (lstm2): LSTM(65, 260, num_layers=3, bidirectional=True)
  (linear): Linear(in_features=520, out_features=1, bias=True)
)

I’m using batch size of 64.
The GPU is a “NVIDIA Tesla P100 16GB”
The error I get is:

RuntimeError: CUDA out of memory. Tried to allocate 7.63 GiB (GPU 0; 15.90 GiB total capacity; 12.06 GiB already allocated; 3.16 GiB free; 12.08 GiB reserved in total by PyTorch)
srun: error: gpu018: task 0: Exited with exit code 1

I’m starting to think that I’m doing something wrong in my code since I think 16GB should be more than enough for this amount of data and this model.
I share here my code, maybe someone can tell me if I’m doing something wrong or missing something:

num_features = 65
HIDDEN_SIZE = num_features * 4
BATCH_SIZE = 64
OUTPUT_DIM = 1
NUM_LAYERS = 2
LEARNING_RATE = 0.0005
NUM_EPOCHS = 500
SEED = 42
# Set seeds for python, numpy and torch
np.random.seed(SEED)
torch.manual_seed(SEED)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

class LSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, batch_size, output_dim, num_layers):
        super(LSTM, self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.batch_size = batch_size
        self.num_layers = num_layers

        self.lstm2 = nn.LSTM(self.input_dim, self.hidden_dim, self.num_layers, bidirectional=True)
        self.linear = nn.Linear(self.hidden_dim*2, output_dim)

    def forward(self, input):
        lstm_out2, self.hidden = self.lstm2(input)
        y_pred = self.linear(lstm_out2)
        return y_pred


if __name__ == '__main__':

    if torch.cuda.is_available():
        dtype = torch.cuda.FloatTensor
    else:
        dtype = torch.float

    file = 'D:\\QU-Lab\\coding\\PyTorch\\LSTM\\featuresNorm_MFCC_Extended20.csv'
    file = 'Z:\\opt\\Noise_Level\\mfcc_data\\featuresNorm_MFCC_Extended20.csv'
    file = '../mfcc_data/featuresNorm_MFCC_Extended20.csv'

    features = load_data("file.csv")
    
    X = np.hstack((features.iloc[:, 1:66].values, features['FILE'].values.reshape(len(features), 1)))
    y = features['LABEL_LEVEL'].values
	
	# Split the data into training and testing

    X_train = torch.from_numpy(X_train[:, 0:65].astype(np.float32)).type(dtype)
    X_train = X_train.unsqueeze(0)
    y_train = torch.from_numpy(y_train.astype(np.float32)).type(dtype)

    X_test = torch.from_numpy(X_test[:, 0:65].astype(np.float32)).type(dtype)
    X_test = X_test.unsqueeze(0)
    y_test = torch.from_numpy(y_test.astype(np.float32)).type(dtype)
	
	
    lstm_model = LSTM(num_features, HIDDEN_SIZE, batch_size=BATCH_SIZE, output_dim=OUTPUT_DIM, num_layers=NUM_LAYERS)
	# The same but with variables values
	# lstm_model = LSTM(65, 260, 64, 1, 2)
    
    lstm_model.to(device)
    loss_function = torch.nn.MSELoss(reduction='mean')
    optimizer = torch.optim.Adam(lstm_model.parameters(), lr=LEARNING_RATE)

    hist = np.zeros(NUM_EPOCHS)
    for epoch in range(NUM_EPOCHS):
        lstm_model.zero_grad()

        y_pred = lstm_model(X_train)
        y_pred = y_pred.squeeze()
        loss = loss_function(y_pred, y_train)
		
        if epoch % 20 == 0:
            print("Epoch ", epoch, "MSE: ", loss.item())
            
        hist[epoch] = loss.item()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Thank you very much in advance for your help! :slight_smile:

Could you post the shapes of a dummy input and target tensor?
Based on your model, I assume you are passing the input as [seq, batch_size, nb_features], while the sequence lengths doesn’t seem to be defined.
I just tried with an input tensor of [64, 64, 65] and am using ~965MB with the CUDA context.

1 Like

I can see the following issues with your code:

  • The constructor of LSTM does not have a parameter batch_size. LSTM does not do batch handling but takes all batches you give it. It’s up to you to split your dataset into batches.
  • In you training loop, you just loop over epochs. I cannot see any loop over batches. You have to do this manually, e.g., in a nested loop with the outer loop fir the epochs and the inner loop for the batches within each eopch. In short, I think you give your whole dataset to the LSTM, which naturally will bust your memory.
  • You also don’t seem the re-initialize or detach() the hidden state after each batch. Check out the initHidden() method in this tutorial. In a nutshell, without that your backprop graph for the hidden state would grow with every batch.

I hope that helps.

1 Like

Hi, thank you very much for taking a look at this.
The shape of X_train at y_pred = lstm_model(X_train) is:

torch.Size([1, 1969875, 65])

Then y_pred is of shape:

torch.Size([1969875])

The batch_size I passed only as parameter when creating my model, I assume that the model take by itself the “batch sizes” of X_train?
Thanks a lot for your help.

No, you have to create the batches yourself. LSTM is not doing it for you :).

Thank you very much @ptrblck and @vdw. I was missing giving to the network the data split in batches and doing the init_hiddden(). Now my model is running on GPU without problems. Thanks again for your help.
I share here my code updated in case it is useful for someone else with the same problem:

import torch
import torch.nn as nn
import numpy as np
import pandas as pd

num_features = 65
# The size of the hidden layer
HIDDEN_SIZE = num_features * 4
# The batch size
BATCH_SIZE = 75
OUTPUT_DIM = 1
NUM_LAYERS = 3
LEARNING_RATE = 0.0005
NUM_EPOCHS = 500

# Set seeds for python, numpy and torch
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)


class LSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, batch_size, output_dim, num_layers):
        super(LSTM, self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.batch_size = batch_size
        self.num_layers = num_layers

        # Define the LSTM layer
        self.lstm2 = nn.LSTM(self.input_dim, self.hidden_dim, self.num_layers, bidirectional=True)

        # Define the output layer
        self.linear = nn.Linear(self.hidden_dim*2, output_dim)

    def init_hidden(self):
        return (torch.zeros(self.num_layers, self.batch_size, self.hidden_dim),
                torch.zeros(self.num_layers, self.batch_size, self.hidden_dim))

    
    def forward(self, input):
        # Forward pass through LSTM layer
        lstm_out2, self.hidden = self.lstm2(input)

        y_pred = self.linear(lstm_out2)
        return y_pred


if __name__ == '__main__':

    # dtype = torch.FloatTensor
    if torch.cuda.is_available():
        dtype = torch.cuda.FloatTensor  # Uncomment this to run on GPU
    else:
        dtype = torch.float

    file = 'data.csv'

    features = load_data(file)
    
    # X = features.iloc[:, 1:66].values
    X = np.hstack((features.iloc[:, 1:66].values, features['FILE'].values.reshape(len(features), 1)))
    y = features['LABEL_LEVEL'].values
	
	# Split the data into training and testing

    
    X_train = torch.from_numpy(X_train[:, 0:65].astype(np.float32)).type(dtype)
    y_train = torch.from_numpy(y_train.astype(np.float32)).type(dtype)
	
    lstm_model = LSTM(num_features, HIDDEN_SIZE, batch_size=BATCH_SIZE, output_dim=OUTPUT_DIM, num_layers=NUM_LAYERS)
    lstm_model.to(device)
    loss_function = torch.nn.MSELoss(reduction='mean')
    optimizer = torch.optim.Adam(lstm_model.parameters(), lr=LEARNING_RATE)

    print(lstm_model)

    train_on_gpu = torch.cuda.is_available()
    if train_on_gpu:
        print("\nTraining on GPU")
    else:
        print("\nNo GPU, training on CPU")

    num_batches = int(X_train.shape[0] / BATCH_SIZE)
    hist = np.zeros(NUM_EPOCHS)

    for epoch in range(NUM_EPOCHS):
        # Init hidden state - if you don't want a stateful LSTM (between epochs)
        train_loss = 0.0
        lstm_model.hidden = lstm_model.init_hidden()
        for i in range(num_batches):
            lstm_model.zero_grad()

            X_train_batch = X_train[i * BATCH_SIZE: (i+1)*BATCH_SIZE, ]
            y_train_batch = y_train[i * BATCH_SIZE: (i+1)*BATCH_SIZE, ]
            X_train_batch = X_train_batch.unsqueeze(0)
			
            y_pred = lstm_model(X_train_batch)

            y_pred = y_pred.squeeze()
            loss = loss_function(y_pred, y_train_batch)
            train_loss += loss.detach().item()

            # Zero out gradient, else they will accumulate between epochs
            optimizer.zero_grad()
            # Backward pass
            loss.backward()
            # Update parameters
            optimizer.step()
			
        if epoch % 20 == 0:
            print("Epoch ", epoch, "MSE: ", train_loss/num_batches)
            print("Epoch ", epoch, "train_loss: ", train_loss)
			
    # Saving the model
    torch.save(lstm_model.state_dict(), 'BiLSTM_model.pytorch')
1 Like