I’m training a simple LSTM model using two GPUs (2 x GeForce RTX 2080 Ti) via `DataParallel`

:

```
class MyLSTM(nn.Module):
def __init__(self, input_dim, hidden_dim):
super(MyLSTM,self).__init__()
self.input_dim = input_dim
self.hidden_dim = hidden_dim
self.LSTM = nn.LSTM(input_dim, hidden_dim, batch_first=True)
self.LNN = nn.Linear(hidden_dim, input_dim)
def forward(self, i):
self.LSTM.flatten_parameters()
o, _ = self.LSTM(i)
return self.LNN(o)
```

```
model = nn.DataParallel(model)
model.cuda()
```

Given the size of my input and the amount of available memory, I found that I can use a maximum batch size of 28. Each sequence is a tensor of 1440 * 3969, hence each batch should be a tensor of 28 * 1440 * 3969.

What are the best practices about loading the input data to the GPUs? By reading around a bit, it seems that one should use `Dataset`

and return CPU tensors and `DataLoader`

with `'pin_memory' = True`

, then during training copy the batch to the GPUs with `batch.to('cuda', non_blocking=True)`

.

With this approach I get a processing rate of ~0.352 batches per second.

On the other hand, by returning CUDA tensors directly from the `Dataset`

(hence avoiding pinned memory at all), I get a higher processing rate of 0.427 batches per second.

How is that a apparently “sub-optimal” approach yields a higher processing rate?

Am I doing something wrong, or is it related to the (relatively) small batch size?