I’m training a simple LSTM model using two GPUs (2 x GeForce RTX 2080 Ti) via
class MyLSTM(nn.Module): def __init__(self, input_dim, hidden_dim): super(MyLSTM,self).__init__() self.input_dim = input_dim self.hidden_dim = hidden_dim self.LSTM = nn.LSTM(input_dim, hidden_dim, batch_first=True) self.LNN = nn.Linear(hidden_dim, input_dim) def forward(self, i): self.LSTM.flatten_parameters() o, _ = self.LSTM(i) return self.LNN(o)
model = nn.DataParallel(model) model.cuda()
Given the size of my input and the amount of available memory, I found that I can use a maximum batch size of 28. Each sequence is a tensor of 1440 * 3969, hence each batch should be a tensor of 28 * 1440 * 3969.
What are the best practices about loading the input data to the GPUs? By reading around a bit, it seems that one should use
Dataset and return CPU tensors and
'pin_memory' = True, then during training copy the batch to the GPUs with
With this approach I get a processing rate of ~0.352 batches per second.
On the other hand, by returning CUDA tensors directly from the
Dataset (hence avoiding pinned memory at all), I get a higher processing rate of 0.427 batches per second.
How is that a apparently “sub-optimal” approach yields a higher processing rate?
Am I doing something wrong, or is it related to the (relatively) small batch size?