Sharing GPU tensors with torch.utils.DataLoader

I want use PyTorch’s dataloader with multiprocessing to return tensors allocated in the GPU. The docs explicitly say this is not recommended and point to the CUDA in multiprocessing documentation:

From the documentation I gather that this should work properly as long as I wrap my training code in if __main__ == '__name__', use “spawn” as the multiprocessing context, keep persistent_workers = True, and del the tensors after using them on the main training loop, so the workers can release the memory of the copied tensors. Is this correct or are there any other issues I should be aware of regarding this?