CUDA out of memory error when training a simple BiLSTM

errezeta · February 27, 2020, 5:42pm

Hi all,

I´m new to PyTorch, and I’m trying to train (on a GPU) a simple BiLSTM for a regression task. I have 65 features and the shape of my training set is (1969875, 65). The specific architecture of my model is:

LSTM(
  (lstm2): LSTM(65, 260, num_layers=3, bidirectional=True)
  (linear): Linear(in_features=520, out_features=1, bias=True)
)

I’m using batch size of 64.
The GPU is a “NVIDIA Tesla P100 16GB”
The error I get is:

RuntimeError: CUDA out of memory. Tried to allocate 7.63 GiB (GPU 0; 15.90 GiB total capacity; 12.06 GiB already allocated; 3.16 GiB free; 12.08 GiB reserved in total by PyTorch)
srun: error: gpu018: task 0: Exited with exit code 1

I’m starting to think that I’m doing something wrong in my code since I think 16GB should be more than enough for this amount of data and this model.
I share here my code, maybe someone can tell me if I’m doing something wrong or missing something:

num_features = 65
HIDDEN_SIZE = num_features * 4
BATCH_SIZE = 64
OUTPUT_DIM = 1
NUM_LAYERS = 2
LEARNING_RATE = 0.0005
NUM_EPOCHS = 500
SEED = 42
# Set seeds for python, numpy and torch
np.random.seed(SEED)
torch.manual_seed(SEED)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

class LSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, batch_size, output_dim, num_layers):
        super(LSTM, self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.batch_size = batch_size
        self.num_layers = num_layers

        self.lstm2 = nn.LSTM(self.input_dim, self.hidden_dim, self.num_layers, bidirectional=True)
        self.linear = nn.Linear(self.hidden_dim*2, output_dim)

    def forward(self, input):
        lstm_out2, self.hidden = self.lstm2(input)
        y_pred = self.linear(lstm_out2)
        return y_pred


if __name__ == '__main__':

    if torch.cuda.is_available():
        dtype = torch.cuda.FloatTensor
    else:
        dtype = torch.float

    file = 'D:\\QU-Lab\\coding\\PyTorch\\LSTM\\featuresNorm_MFCC_Extended20.csv'
    file = 'Z:\\opt\\Noise_Level\\mfcc_data\\featuresNorm_MFCC_Extended20.csv'
    file = '../mfcc_data/featuresNorm_MFCC_Extended20.csv'

    features = load_data("file.csv")
    
    X = np.hstack((features.iloc[:, 1:66].values, features['FILE'].values.reshape(len(features), 1)))
    y = features['LABEL_LEVEL'].values
	
	# Split the data into training and testing

    X_train = torch.from_numpy(X_train[:, 0:65].astype(np.float32)).type(dtype)
    X_train = X_train.unsqueeze(0)
    y_train = torch.from_numpy(y_train.astype(np.float32)).type(dtype)

    X_test = torch.from_numpy(X_test[:, 0:65].astype(np.float32)).type(dtype)
    X_test = X_test.unsqueeze(0)
    y_test = torch.from_numpy(y_test.astype(np.float32)).type(dtype)
	
	
    lstm_model = LSTM(num_features, HIDDEN_SIZE, batch_size=BATCH_SIZE, output_dim=OUTPUT_DIM, num_layers=NUM_LAYERS)
	# The same but with variables values
	# lstm_model = LSTM(65, 260, 64, 1, 2)
    
    lstm_model.to(device)
    loss_function = torch.nn.MSELoss(reduction='mean')
    optimizer = torch.optim.Adam(lstm_model.parameters(), lr=LEARNING_RATE)

    hist = np.zeros(NUM_EPOCHS)
    for epoch in range(NUM_EPOCHS):
        lstm_model.zero_grad()

        y_pred = lstm_model(X_train)
        y_pred = y_pred.squeeze()
        loss = loss_function(y_pred, y_train)
		
        if epoch % 20 == 0:
            print("Epoch ", epoch, "MSE: ", loss.item())
            
        hist[epoch] = loss.item()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Thank you very much in advance for your help!

ptrblck · February 28, 2020, 8:15am

Could you post the shapes of a dummy input and target tensor?
Based on your model, I assume you are passing the input as [seq, batch_size, nb_features], while the sequence lengths doesn’t seem to be defined.
I just tried with an input tensor of [64, 64, 65] and am using ~965MB with the CUDA context.

vdw · February 28, 2020, 1:15pm

I can see the following issues with your code:

The constructor of LSTM does not have a parameter batch_size. LSTM does not do batch handling but takes all batches you give it. It’s up to you to split your dataset into batches.
In you training loop, you just loop over epochs. I cannot see any loop over batches. You have to do this manually, e.g., in a nested loop with the outer loop fir the epochs and the inner loop for the batches within each eopch. In short, I think you give your whole dataset to the LSTM, which naturally will bust your memory.
You also don’t seem the re-initialize or detach() the hidden state after each batch. Check out the initHidden() method in this tutorial. In a nutshell, without that your backprop graph for the hidden state would grow with every batch.

I hope that helps.

errezeta · February 28, 2020, 1:19pm

Hi, thank you very much for taking a look at this.
The shape of X_train at y_pred = lstm_model(X_train) is:

torch.Size([1, 1969875, 65])

Then y_pred is of shape:

torch.Size([1969875])

The batch_size I passed only as parameter when creating my model, I assume that the model take by itself the “batch sizes” of X_train?
Thanks a lot for your help.

vdw · February 28, 2020, 1:27pm

No, you have to create the batches yourself. LSTM is not doing it for you :).

errezeta · February 29, 2020, 11:55am

Thank you very much @ptrblck and @vdw. I was missing giving to the network the data split in batches and doing the init_hiddden(). Now my model is running on GPU without problems. Thanks again for your help.
I share here my code updated in case it is useful for someone else with the same problem:

import torch
import torch.nn as nn
import numpy as np
import pandas as pd

num_features = 65
# The size of the hidden layer
HIDDEN_SIZE = num_features * 4
# The batch size
BATCH_SIZE = 75
OUTPUT_DIM = 1
NUM_LAYERS = 3
LEARNING_RATE = 0.0005
NUM_EPOCHS = 500

# Set seeds for python, numpy and torch
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)


class LSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, batch_size, output_dim, num_layers):
        super(LSTM, self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.batch_size = batch_size
        self.num_layers = num_layers

        # Define the LSTM layer
        self.lstm2 = nn.LSTM(self.input_dim, self.hidden_dim, self.num_layers, bidirectional=True)

        # Define the output layer
        self.linear = nn.Linear(self.hidden_dim*2, output_dim)

    def init_hidden(self):
        return (torch.zeros(self.num_layers, self.batch_size, self.hidden_dim),
                torch.zeros(self.num_layers, self.batch_size, self.hidden_dim))

    
    def forward(self, input):
        # Forward pass through LSTM layer
        lstm_out2, self.hidden = self.lstm2(input)

        y_pred = self.linear(lstm_out2)
        return y_pred


if __name__ == '__main__':

    # dtype = torch.FloatTensor
    if torch.cuda.is_available():
        dtype = torch.cuda.FloatTensor  # Uncomment this to run on GPU
    else:
        dtype = torch.float

    file = 'data.csv'

    features = load_data(file)
    
    # X = features.iloc[:, 1:66].values
    X = np.hstack((features.iloc[:, 1:66].values, features['FILE'].values.reshape(len(features), 1)))
    y = features['LABEL_LEVEL'].values
	
	# Split the data into training and testing

    
    X_train = torch.from_numpy(X_train[:, 0:65].astype(np.float32)).type(dtype)
    y_train = torch.from_numpy(y_train.astype(np.float32)).type(dtype)
	
    lstm_model = LSTM(num_features, HIDDEN_SIZE, batch_size=BATCH_SIZE, output_dim=OUTPUT_DIM, num_layers=NUM_LAYERS)
    lstm_model.to(device)
    loss_function = torch.nn.MSELoss(reduction='mean')
    optimizer = torch.optim.Adam(lstm_model.parameters(), lr=LEARNING_RATE)

    print(lstm_model)

    train_on_gpu = torch.cuda.is_available()
    if train_on_gpu:
        print("\nTraining on GPU")
    else:
        print("\nNo GPU, training on CPU")

    num_batches = int(X_train.shape[0] / BATCH_SIZE)
    hist = np.zeros(NUM_EPOCHS)

    for epoch in range(NUM_EPOCHS):
        # Init hidden state - if you don't want a stateful LSTM (between epochs)
        train_loss = 0.0
        lstm_model.hidden = lstm_model.init_hidden()
        for i in range(num_batches):
            lstm_model.zero_grad()

            X_train_batch = X_train[i * BATCH_SIZE: (i+1)*BATCH_SIZE, ]
            y_train_batch = y_train[i * BATCH_SIZE: (i+1)*BATCH_SIZE, ]
            X_train_batch = X_train_batch.unsqueeze(0)
			
            y_pred = lstm_model(X_train_batch)

            y_pred = y_pred.squeeze()
            loss = loss_function(y_pred, y_train_batch)
            train_loss += loss.detach().item()

            # Zero out gradient, else they will accumulate between epochs
            optimizer.zero_grad()
            # Backward pass
            loss.backward()
            # Update parameters
            optimizer.step()
			
        if epoch % 20 == 0:
            print("Epoch ", epoch, "MSE: ", train_loss/num_batches)
            print("Epoch ", epoch, "train_loss: ", train_loss)
			
    # Saving the model
    torch.save(lstm_model.state_dict(), 'BiLSTM_model.pytorch')