I’m encountering an issue where even though I have pin_memory=True set in my DataLoader, the data remains on the CPU during training. My CUDA GPU is available (torch.cuda.is_available() returns True).
train = MyDataSet() # Instance of an IterableDataset
train_loader = DataLoader(dataset=train, batch_size=batch_size, prefetch_factor=10, num_workers=4, pin_memory=True)
for batch_index, (x_batch, y_batch) in enumerate(train_loader):
print(x_batch.device) # Shows "cpu"
print(torch.cuda.is_available()) # Shows 'True
output = model(x_batch)
# ... training code ...
This results in a RuntimeError stating that input and parameter tensors are on different devices (CPU vs. GPU).
I want to avoid having to explicitly move the data to the GPU because its causing a bottleneck.
I’m thinking I could create a new process that would be responsible for putting the data into the GPU and then push those references to a mp.Queue. Then the training loop, instead of using a for loop on top of the Dataloader, consume the references from the Queue. This way I could overlap data transfers and fully utilize my GPU.
I have not tried this, but I’m wondering if this would be a good approach, or there is another recommended way to overlap data transfers to GPU. Also wonder if once moved data to the GPU, can be put in a Queue.
Using num_workers>0 will already use multiple processes in the background to create batches of data. Pinning the memory allows you to move the data asynchronously to the device w.r.t. the host. However, since you don’t have any overlapping work to do, the benefits might be insignificant.