How to train LSTM with GPU

(Seungsu Kim) #1

I’m trying to train a LSTM connected to couple MLP layers. The model is coded as follows:

class RNNBlock(nn.Module):

def __init__(self, in_dim, hidden_dim, num_layer=1, dropout=0):
    super(RNNBlock, self).__init__()

    self.hidden_dim = hidden_dim
    self.num_layer = num_layer
    self.lstm = nn.LSTM(in_dim, hidden_dim, num_layer, dropout)

def forward(self, onehot, length):
    batch_size = onehot.shape[0]
    h_in = torch.randn(self.num_layer, batch_size, self.hidden_dim).cuda()
    c_in = torch.randn(self.num_layer, batch_size, self.hidden_dim).cuda()
    packed = nn.utils.rnn.pack_padded_sequence(onehot, length, batch_first=True).cuda()
    output, (h_out, c_out) = self.lstm(packed, (h_in, c_in))
    unpacked, unpacked_length = nn.utils.rnn.pad_packed_sequence(output, batch_first=True)
    vectors = list()
    for i, vector in enumerate(unpacked):
        vectors.append(unpacked[i, unpacked_length[i]-1, :].view(1, -1))
    out =, 0)
    return out

class Predictor(nn.Module):

def __init__(self, in_dim, out_dim, act=None):
    super(Predictor, self).__init__()

    self.linear = nn.Linear(in_dim, out_dim)
    self.activation = act

def forward(self, x):
    out = self.linear(x)
    if self.activation != None:
        out = self.activation(out)
    return out

class RNNNet(nn.Module):

def __init__(self, args):
    super(RNNNet, self).__init__()

    self.rnnBlock = RNNBlock(args.in_dim, args.hidden_dim, args.num_layer, args.dropout)
    self.pred1 = Predictor(args.hidden_dim, args.pred_dim1, act=nn.ReLU())
    self.pred2 = Predictor(args.pred_dim1, args.pred_dim2, act=nn.ReLU())
    self.pred3 = Predictor(args.pred_dim2, args.out_dim)

def forward(self, onehot, length):
    out = self.rnnBlock(onehot, length)
    out = self.pred1(out)
    out = self.pred2(out)
    out = self.pred3(out)
    return out

and this is my train function

def train(model, device, optimizer, criterion, data_train, bar, args):
epoch_train_loss = 0
epoch_train_mae = 0

for i, batch in enumerate(data_train):
    list_onehot = torch.tensor(batch[0]).cuda().float()
    list_length = torch.tensor(batch[1]).cuda()
    list_logP = torch.tensor(batch[2]).cuda().float()
    # Sort onehot tensor with respect to the sequence length.
    list_length, list_index = torch.sort(list_length, descending=True)
    list_onehot = torch.Tensor([list_onehot.tolist()[i] for i in list_index]).cuda().float()


    list_pred_logP = model(list_onehot, list_length).squeeze().cuda()
    list_pred_logP.require_grad = False

    train_loss = criterion(list_pred_logP, list_logP)
    train_mae = mean_absolute_error(list_pred_logP.tolist(), list_logP.tolist())
    epoch_train_loss += train_loss.item()
    epoch_train_mae += train_mae



epoch_train_loss /= len(data_train)
epoch_train_mae /= len(data_train)

return model, epoch_train_loss, epoch_train_mae

The list_onehot and list_length tensors are loaded from the DataLoader and uploaded to GPU. Then, to use packed sequence as input, I’ve sorted the both list_onehot and list_length and uploaded to GPU. The model was uploaded to GPU and h_in, c_in tensors and packed sequence object were also uploaded to the GPU. However, when I try to run this code, it does not use GPU but only use CPU. What should I do to use GPU to train this model?

(Ben Eyal) #2

Did you run model.cuda()?

(Seungsu Kim) #3


Yes I did. First I did then now I’m doing model.cuda() but both does not work

(Ben Eyal) #4

Hmm… Do you create the optimizer after calling model.cuda() or before?

(Seungsu Kim) #5

After calling model.cuda(). This is my experiment function

def experiment(dict_partition, device, bar, args):
    time_start = time.time()

model = RNNNet(args)

if args.optim == 'Adam':
    optimizer = optim.Adam(model.parameters(),
elif args.optim == 'RMSprop':
    optimizer = optim.RMSprop(model.parameters(),
elif args.optim == 'SGD':
    optimizer = optim.SGD(model.parameters(),
    assert False, 'Undefined Optimizer Type'

criterion = nn.MSELoss()
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=args.step_size, gamma=args.gamma)

list_train_loss = list()
list_val_loss = list()
list_train_mae = list()
list_val_mae = list()

data_train = DataLoader(dict_partition['train'], batch_size=args.batch_size, shuffle=args.shuffle)
data_val = DataLoader(dict_partition['val'], batch_size=args.batch_size, shuffle=args.shuffle)

for epoch in range(args.epoch):
    model, train_loss, train_mae = train(model, device, optimizer, criterion, data_train, bar, args)

    mode, val_loss, val_mae = validate(model, device, criterion, data_val, bar, args)

data_test = DataLoader(dict_partition['test'], batch_size=args.batch_size, shuffle=args.shuffle)

mae, std, logP_total, pred_logP_total = test(model, device, data_test, args)

time_end = time.time()
time_required = time_end - time_start

args.list_train_loss = list_train_loss
args.list_val_loss = list_val_loss
args.list_train_mae = list_train_mae
args.list_val_mae = list_val_mae
args.logP_total = logP_total
args.pred_logP_total = pred_logP_total
args.mae = mae
args.std = std
args.time_required = time_required

return args

(Ben Eyal) #6

Weird, I can’t think of any reason why it won’t work… You’re not getting any errors during training?

(Seungsu Kim) #7

Yes the model is trained on CPU without error.
If I watch nvidia-smi, I can see that 477MB of data are uploaded to GPU memory, but not using GPU to train it.

(Seungsu Kim) #8

May I send my github link to show you the full code?

(Ben Eyal) #9

Sure. if I can, I’ll run it myself.
@ptrblck Any ideas?

(Seungsu Kim) #10


This is the link. You can see the Assignment6_logP_RNN.ipynb file.

Thanks a lot, I’m struggling with this problem for two days.


Not sure what’s going on, as the model seems to be on the GPU.
I would assume @Probe would get an error in forward, as it seems that

h_in = nn.Parameter(torch.randn(self.num_layer, batch_size, self.hidden_dim))
c_in = nn.Parameter(torch.randn(self.num_layer, batch_size, self.hidden_dim)) 

are still on the CPU, while self.lstm should be on the GPU.
Could you check that?

(Seungsu Kim) #12


No error occurs but the model runs on CPU. To upload h_in and c_in to GPU, what should I do among 3 possibilities?

h_in = nn.Parameter(torch.randn(self.num_layer, batch_size, self.hidden_dim)).cuda()
h_in = nn.Parameter(torch.randn(self.num_layer, batch_size, self.hidden_dim).cuda())
h_in = torch.randn(self.num_layer, batch_size, self.hidden_dim).cuda()


Use the second approach and try it again. I’m still not sure why the code doesn’t throw an error.
Let me know, if the model still runs on CPU and I’ll try to debug it a bit later.

(Seungsu Kim) #14


The second approach still cause no error but runs on CPU.

Also, I’ve seen an article that pack_padded_sequence require length_list of CPU tensor. In addition, the forward function of RNN block returns only the last result of each batch. I’ve used function to do this. Is this might be the reason?

(Seungsu Kim) #15

@beneyal @ptrblck

I’m not sure why, but the problem has been solved
The answer was using custom collate function in DataLoader, so that the DataLoader gives packedSequence object and labels, not generating packedSequence object in the forward function of custom LSTM module.

I will commit the working version soon.


Good to hear you’ve solved this issue!
I’m still a bit confused why your code didn’t throw an error, as it seems some parameters were on the GPU while others stayed on the CPU.

(Edoardo Daniele Cannas) #17

Hi everybody,

I am replying to this topic since I am facing a similar problem to the one of @Probe, but his solution of using a custom collate function in the DataLoader is not working for me.

I have a recurrent autoencoder, of which I have to gauge the enconding capability, therefore my net is composed of two layers (code below):

  1. an encoding layer composed by the LSTM;
  2. a decoding layer, which is nothing but a dense layer that tries to reconstruct the input from the LSTM output.
class RnnLSTMAutoEncoder(nn.Module):
    """ Rnn based on the LSTM model

              input_length (int): input dimension
              code_length (int): LSTM output dimension
              num_layers (int): LSTM layers' number

    ##  Constructor
    def __init__(self, input_length, code_length, num_layers=1):
        super(RnnLSTMAutoEncoder, self).__init__()

        #  Attributes
        self.input_length = input_length
        self.code_length = code_length
        self.num_layers = num_layers

        #  Nets
        self.encodeLayer = nn.LSTM(self.input_length, self.code_length, num_layers=self.num_layers, batch_first=True)
        self.decodeLayer = nn.Linear(self.code_length, self.input_length)

        # Decode layer parameters' initialization
        self.decodeLayer.bias = nn.Parameter(torch.zeros_like(self.decodeLayer.bias))

    ##  Encode function
    def encode(self, x):
        # CODING
        output, _ = self.encodeLayer(x)
        return output

    ##  Decode function
    def decode(self, x):
        # DECODING (linear dense layer followed by an activation function [identity in this case, so none])
        x = self.decodeLayer(x)
        return x

    ##  Forward function
    def forward(self, x):
        encoded = self.encode(x)
        if isinstance(encoded, torch.Tensor):
            decoded = self.decode(encoded)
            unpacked, unpacked_length = nn.utils.rnn.pad_packed_sequence(encoded, batch_first=True)
            vectors = list()
            for i, vector in enumerate(unpacked):
                vectors.append(unpacked[i, unpacked_length[i] - 1, :].view(1, -1))
            decoded = self.decode(, 0))
        return decoded

Following Probe’s suggestion, I wrote my custom collate function as follows:

def my_collate(batch):
    data = [item[0] for item in batch]
    x = torch.stack(data)

    # Lengths vector for the correct packing of the input
    lengths = torch.zeros(x.size()[0])
    for i in range(x.size()[0]):
        for j in range(seq_length):
            if sum(1 for k in x[i, j, :] if k != 0) == x.size()[2]:
                lengths[i] += 1

    # Both padded sequences and lengths should be ordered descendingly wrt to the sequence length
    lengths, indices = torch.sort(lengths, descending=True)
    lengths = lengths.type(torch.ByteTensor)
    x = x[indices, :, :]

    y = torch.zeros(train_batch_size, x.size()[2])
    for i in range(train_batch_size):
        seq_el_idx = lengths[i].item() - 1
        y[i, :] = x[i, seq_el_idx, :]

    # Packing the data
    x = torch.nn.utils.rnn.pack_padded_sequence(x, lengths, batch_first=True)

    return [x, y]

My dataset is made of vectors of features extracted from video frames, so what I give to the LSTM is a sequence of vectors that from step t goes back in time till step t-seq_length.
Obviously, for the first time steps (for example 1, the first video frame), I have nothing that goes back in time. Thus, I wrote a custom Dataset class which in this case fills the sequence with zeros till it reaches seq_length, while my collate function converts it in a PackedSequence object (the x element returned in the batch).
For evaluating the net’s perfomance instead, I just need to compute the loss between the last element of the sequence (rearranged in the y element returned in the batch), and the last element of the packed sequence I receive as output.

As Probe did in his code, with the custom collate function the DataLoader gives packedSequences as inputs to the autoencoder, while the padding of the output of the LSTM is handled in the forward function.
Everything works fine, but nonetheless my code is not running on the GPU.

I have debugged my code with PyCharm, and everything seems to be on the GPU: the input sequences, the LSTM output, the final autoencoder output, etc…, and in fact I can see the data uploaded to the GPU memory, but still, the whole training procedure takes place on the CPU.

I am currently managing the whole training procedure with Ignite, and my training code is the following:

##  Data loader helper

def get_data_loaders(train_batch_size, val_batch_size, num_workers, train_dir, val_dir, seq_length):
    #  Custom data transformation
    #  example: data_transform = Compose([ToTensor(), Normalize((0.1307,), (0.3081,))])
    data_transform = transforms.Lambda(lambda x: normalize_feature_vector(x))

    #  Dataset instantiation

    co_t_set = CoOccurrencesDatasetRnnTime(train_dir, seq_length, data_transform)
    co_v_set = CoOccurrencesDatasetRnnTime(val_dir, seq_length, data_transform)

    #  Training set DataLoader

    train_loader = Data.DataLoader(co_t_set, train_batch_size, collate_fn=my_collate, shuffle=False,

    #  Validation set DataLoader

    val_loader = Data.DataLoader(co_v_set, val_batch_size, collate_fn=my_collate, shuffle=False,

    return train_loader, val_loader

##  Batch preparation

def autoencoder_batch(batch, device, non_blocking=False):
    # Simply sends the data to GPU
    x, y = batch

    if device == 'cuda':
        x = x.cuda().to(device)
        y = y.cuda().to(device)

    return x, y

##  Training routine

def autoencoder_training(trainer, batch):
    # Extract the input and "label"
    bx, by = autoencoder_batch(batch, device)

    # Send the model to GPU (if available)
    if device == 'cuda':

    # Forwarding
    decoded = model(bx)

    # Compute the loss
    loss = loss_func(decoded, by)

    # Optimize

    return loss.item()

###     Model training    ###

##  Dataset loading parameters

train_path = 'training_set_path_on_my_machine'
val_path = 'validation_set_path_on_my_machine'
num_workers = 4

##  Training parameters

epochs = 30
train_batch_size = 5
val_batch_size = 5
LR = 0.005  # learning rate
input_length = 625
code_length = 100
seq_length = 25
es_patience = 10
exp_decay = 0.95
log_dir = 'logging_directory_on_my_machine'
log_interval = 10000  # number of batches for each log on the console

##  Logging configuration

                    filemode='w', format='%(name)s - %(levelname)s - %(message)s', level=logging.INFO)

if __name__ == '__main__':

    #  Dataloaders instantiation
    print('Loading the datasets and extracting the features...')'Loading the datasets and extracting the features...')
    train_loader, val_loader = get_data_loaders(train_batch_size, val_batch_size,
                                                num_workers, train_path, val_path, seq_length)
    print('Features extracted!')'Features extracted!')

    #  Model instantiation
    model = RnnLSTMAutoEncoder(input_length, code_length)

    #  Writer instantiation for TensorboardX
    writer = create_summary_writer(model, train_loader, log_dir)  # creates a summary write with tensorboardX

    #  GPU loading (if available)
    device = 'cpu'
    if torch.cuda.is_available():
        device = 'cuda'

    #  Optimizer, trainer and evaluator instantiation
    optimizer = optim.Adam(model.parameters(), lr=LR)
    loss_func = nn.MSELoss()
    trainer = Engine(autoencoder_training)
    evaluator = create_supervised_evaluator(model,
                                            metrics={'MSE': Loss(nn.MSELoss())},

    ##          EVENTS HANDLER FOR IGNITE          ##

    def log_training_loss(engine):
        iter = (engine.state.iteration - 1) % len(train_loader) + 1
        if iter % log_interval == 0:
            print("Epoch[{}] Iteration[{}/{}] Loss: {:.5f}"
                  "".format(engine.state.epoch, iter, len(train_loader), engine.state.output))
            writer.add_scalar("training/loss", engine.state.output, engine.state.iteration)
  "Epoch[{}] Iteration[{}/{}] Loss: {:.5f}"
                         "".format(engine.state.epoch, iter, len(train_loader), engine.state.output))


    # Early stopping implementation
    def score_function(engine):
        val_loss = engine.state.metrics['MSE']
        return -val_loss

    handler = EarlyStopping(patience=es_patience, score_function=score_function, trainer=trainer)
    evaluator.add_event_handler(Events.EPOCH_COMPLETED, handler)

    # training results logging
    def log_training_results(engine):
        metrics = evaluator.state.metrics
        avg_MSE = metrics['MSE']
        print("Training Results - Epoch: {}, Avg loss: {:.5f}"
              .format(engine.state.epoch, avg_MSE))
        writer.add_scalar("training/avg_loss", avg_MSE, engine.state.epoch)'Training Results - Epoch: {}, Avg loss: {:.5f}'.format(engine.state.epoch, avg_MSE))

    # validation results logging
    def log_validation_results(engine):
        metrics = evaluator.state.metrics
        avg_MSE = metrics['MSE']
        print("Validation Results - Epoch: {}, Avg loss: {:.5f}"
              .format(engine.state.epoch, avg_MSE))
        writer.add_scalar("valdation/avg_loss", avg_MSE, engine.state.epoch)'Validation Results - Epoch: {}, Avg loss: {:.5f}'.format(engine.state.epoch, avg_MSE))

    ##     RUNNING

    print('Training...'), max_epochs=epochs)


Any suggestion on what it might be? Any help or hint is greatly appreciated.

Thanks for your time!

(Edoardo Daniele Cannas) #18

Hey everybody,

I reply here since I managed to solve my issue: the error was so stupid that I’m really embarrassed to write about it :sweat_smile::sweat_smile:
But still, I hope it may be useful for somebody, at least as a reminder that you need to take some breaks and rest from your code in order to clear your mind and see where the bugs are!

Turns out that the problem was how I wrote my_collate function: I spend a lot of time computing the sequences’ lengths on CPU, while the amount of computations executed on GPU is so low that I could not see it on any performance profiler.
Therefore, I simply solved my issue by doing the most obvious thing: make my custom Dataset class give me the sequence’s length along with the sequence itself, and using my_collate function to just create the PackedSequence object for the input sequence. Here is the code for both of them in case it may be helpful for somebody.

My custom Dataset class:

import torch
import numpy as np

class CoOccurrencesDatasetRnnTime(

        Support class for the loading and batching of the co-occurrences of video frames extracted offline.
        The class returns directly the sequence along with its length

            root_dir (string): file path of the .npy file containing the co-occurrences
            sequence_length (int): length of the analyzed sequence by the RNN
            transforms (object torchvision.transform): Pytorch's transforms used to process the co-occurrences

    ##  Constructor
    def __init__(self, root_dir, sequence_length=1, transforms=None):
        self.root_dir = root_dir
        self.seq_length = sequence_length
        self.transforms = transforms
        self.co_occurrences = torch.from_numpy(np.load(root_dir)).type(torch.FloatTensor)
        self.co_occurrences = self.co_occurrences.view(int(self.co_occurrences.size()[0]/10875), -1,

    ##  Override total dataset's length getter
    def __len__(self):
        return int(self.co_occurrences.size()[0]*10875)
        #10875 is the number of features vector for each video frame

    ##  Override single items' getter
    def __getitem__(self, idx):
        f_idx = int(np.floor(idx / 10875)) #frame index
        p_idx = int(np.floor(idx % 10875)) #patch index inside the frame from which the features have been extracted
        if self.transforms is not None:
            if f_idx-self.seq_length < 0:
                seq = torch.zeros(self.seq_length, self.co_occurrences.size()[2])
                seq[0:f_idx+1, :] = self.transforms(self.co_occurrences[0:f_idx+1, p_idx, :])
                seq_len = f_idx + 1
                return [seq, seq_len], self.transforms(self.co_occurrences[f_idx, p_idx, :])
                #only need the last element of the sequence as target value for the loss
                return [self.transforms(self.co_occurrences[f_idx-self.seq_length:f_idx, p_idx, :]), self.seq_length], \
                       self.transforms(self.co_occurrences[f_idx, p_idx, :])
            if f_idx-self.seq_length < 0:
                seq = torch.zeros(self.seq_length, self.co_occurrences.size()[2])
                seq[0:f_idx+1, :] = self.co_occurrences[0:f_idx+1, p_idx, :]
                seq_len = f_idx + 1
                return [seq, seq_len], self.co_occurrences[f_idx, p_idx, :]
                return [self.co_occurrences[f_idx-self.seq_length:f_idx, p_idx, :], self.seq_length], \
                       self.co_occurrences[f_idx, p_idx, :]

and my custom collate function for the Dataloader:

def my_collate(batch):

    # Preparing input sequences
    data = [item[0][0] for item in batch]
    x = torch.stack(data)
    seqs_length = torch.ByteTensor([item[0][1] for item in batch])

    # Both padded sequences and lengths should be ordered descendingly wrt to the sequence length
    lengths, indices = torch.sort(seqs_length, descending=True)
    x = x[indices, :, :]

    # Packing the data
    x = torch.nn.utils.rnn.pack_padded_sequence(x, lengths, batch_first=True)

    # Preparing target values
    y = [item[1] for item in batch]
    y = torch.stack(y)

    return [x, y]

Now the whole procedure takes place on GPU!

(Nam Vo) #19

The real bug here is that you use a profiler to determine the model is running on gpu or not. That makes no sense to me, GPU is under utilized all the time

(Edoardo Daniele Cannas) #20

Hi @lugiavn,

I’m not sure what you are asking, but probably I was not really clear in explaining what my problem was in the first place, so I’ll try to explain it again.

I have debugged my code with PyCharm, and everything seems to be on the GPU: the input sequences, the LSTM output, the final autoencoder output, etc…, and in fact I can see the data uploaded to the GPU memory, but still, the whole training procedure takes place on the CPU.

Initially, as I wrote in my first post, I used PyCharm’s debugger to check if my model was effectively uploaded on GPU, and in fact everything was so.
Nevertheless, I noticed that the GPU was not working, meaning that if I looked at the % of its resource used in a performance profiler, it always indicated 0% or 1%, even if I could see some data uploaded in its memory.
The reason of this behaviour I think was that in my previous code, for each batch loading, I spent a lot of time computing the sequences’ lengths (a very stupid thing to do) on the CPU with the Dataloader's collate function, while the amount of computation executed on the GPU was so little that I could not see its resource used (it continously switched between 0% and 1%).
So, maybe it’s better to say that I had a huge bottleneck in my code: when packing each batch, I did a lot of (useless) work on the CPU for computing the sequence’s length, while the work executed on the GPU itself was so rapidly done (since my network is rather small) that it gave me the impression of not being executed by the GPU at all.
With the fix of my last post, this bottleneck is removed, the training procedure goes as I expect it to go and I can see the GPU’s resources used (since probably there is no more a continous switch between the CPU and GPU during the training): a very stupid error, I’m really sorry if anyone has wasted some time on it!

Anyway, hope this makes more sense to you? Let me know! :slight_smile: