Dataset / Dataloader "best practices" with DataParallel

I’m training a simple LSTM model using two GPUs (2 x GeForce RTX 2080 Ti) via DataParallel:

class MyLSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(MyLSTM,self).__init__()
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.LSTM = nn.LSTM(input_dim, hidden_dim, batch_first=True)
        self.LNN = nn.Linear(hidden_dim, input_dim)

    def forward(self, i):
        self.LSTM.flatten_parameters()
        o, _ = self.LSTM(i)
        return self.LNN(o)
model = nn.DataParallel(model)
model.cuda()

Given the size of my input and the amount of available memory, I found that I can use a maximum batch size of 28. Each sequence is a tensor of 1440 * 3969, hence each batch should be a tensor of 28 * 1440 * 3969.

What are the best practices about loading the input data to the GPUs? By reading around a bit, it seems that one should use Dataset and return CPU tensors and DataLoader with 'pin_memory' = True, then during training copy the batch to the GPUs with batch.to('cuda', non_blocking=True).
With this approach I get a processing rate of ~0.352 batches per second.

On the other hand, by returning CUDA tensors directly from the Dataset (hence avoiding pinned memory at all), I get a higher processing rate of 0.427 batches per second.

How is that a apparently “sub-optimal” approach yields a higher processing rate?
Am I doing something wrong, or is it related to the (relatively) small batch size?

Your approach might have a slower initialization time, but could be faster during the training.
The main drawback is that you most likely load the complete dataset onto the device, which will use (some) GPU memory which will thus not be available for your model.
Also, preprocessing each sample might not be that easy, e.g. since a lot of image processing transformations from torchvision rely on PIL.Image, which uses numpy arrays under the hood and thus needs CPU arrays.

If that’s not the case and your dataset is “quite” small, your approach is valid.

PS: We also recommend to use DistributedDataParallel (as Multi-Process Single-GPU), which should be the fastest approach.