I’m training a simple LSTM model using two GPUs (2 x GeForce RTX 2080 Ti) via DataParallel
:
class MyLSTM(nn.Module):
def __init__(self, input_dim, hidden_dim):
super(MyLSTM,self).__init__()
self.input_dim = input_dim
self.hidden_dim = hidden_dim
self.LSTM = nn.LSTM(input_dim, hidden_dim, batch_first=True)
self.LNN = nn.Linear(hidden_dim, input_dim)
def forward(self, i):
self.LSTM.flatten_parameters()
o, _ = self.LSTM(i)
return self.LNN(o)
model = nn.DataParallel(model)
model.cuda()
Given the size of my input and the amount of available memory, I found that I can use a maximum batch size of 28. Each sequence is a tensor of 1440 * 3969, hence each batch should be a tensor of 28 * 1440 * 3969.
What are the best practices about loading the input data to the GPUs? By reading around a bit, it seems that one should use Dataset
and return CPU tensors and DataLoader
with 'pin_memory' = True
, then during training copy the batch to the GPUs with batch.to('cuda', non_blocking=True)
.
With this approach I get a processing rate of ~0.352 batches per second.
On the other hand, by returning CUDA tensors directly from the Dataset
(hence avoiding pinned memory at all), I get a higher processing rate of 0.427 batches per second.
How is that a apparently “sub-optimal” approach yields a higher processing rate?
Am I doing something wrong, or is it related to the (relatively) small batch size?