Is there anyway to load data into GPU directly?

In every training loop, I use DataLoader to load a batch of image into CPU, and move it to GPU like this:

from torch.utils.data import DataLoader

batchsize = 64
trainset = datasets.CIFAR10(blahblah…)
train_loader = DataLoader(train_dataset, batch_size=batchsize, shuffle=True, num_workers=2)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def train(epoch):
    for batch_index, data in enumerate(train_loader, 0):
        inputs, labels = data
        inputs, labels = inputs.to(device), labels.to(device)

I know that I can wrap dataloader and calling .to(device) in advance instead of using it in every training batch. But .to(device) itself is time-consuming, I mean, transfer a tensor from CPU to GPU is much slower than creating a tensor directly on GPU, isn’t it?

for example:
(randomTensorA is created on CPU and using .to() function to transfer it on GPU,
randomTensorB is created on GPU)

import time
import torch

shape = [300, 300, 300]

a = time.time()
for _ in range(100):
    randomTensorA = torch.randn(shape).to(torch.device('cuda'))
b = time.time()
print('Elapsed Time: %f' % (b-a))

a = time.time()
for _ in range(100):
    randomTensorB = torch.randn(shape, device='cuda')
b = time.time()
print('Elapsed Time: %f' % (b-a))

Terminal output:

Elapsed Time: 24.316857
Elapsed Time: 1.658716

So, is there anyway to let dataloader load dataset directly on GPU? Please let me know, thanks.

The CIFAR10 dataset is stored as binary data on your SSD so you would need to move it to the GPU at one point and cannot “create” it on the device.

My solution is save these data in gpu use torch.save() in advance
And load them like

torch.load('./data.pt',map_location='cuda:0')