I have an image dataset that doesn’t fit in memory. I want to read minibatches off disk, copy them to GPU and train a model on them.
PyTorch’s DataLoader has been very helpful in hiding the cost of loading the minibatch with multithreading, but copying to the GPU is still sequential
I’m trying to pipeline my training loop such that copying data to the GPU happens in parallel with the rest (forward pass, backward backprop, etc) (something like this).
I summarise what I have tried so far below, but I think I’m going far down the wrong path with it. What is the correct way to do this in pyTorch?
What I’ve tried:
so far my training loop looks like this:
trainset = MemmapDataset("dataset.npy") trainloader = DataLoader(trainset, batch_size=param.batch_size, shuffle=True, num_workers=4,pin_memory=True) for i in range(epochs): for features,labels in train_loader: features = features.to(device) labels = labels.to(device) predictions = neural_net(features) ... rest of training ...
I want to change that
to(device) operation such that it runs concurrently with the rest of training; the copy operation should run asynchronously in on one of the cuda streams while the other training kernels run , instead of right now, where all the training stops, and the GPU is essentially idle just copying data
I tried making
to(device) one of the transforms that runs in my dataset class’
__getitem__ method as so:
class ToDeviceTransform: def __init__(self, device): self.device = device def __call__(self, data: torch.Tensor): return data.contiguous().to(self.device)
and this works if
pinned_memory=False in the dataloader, but this is still just a sequential copy, and its slower because now I lose the parallel load from disk.
setting num_workers >0 throws the following runtime error at my ToDeviceTransform.call() hack:
------------- | Cut long traceback above | -------------------
File “/media/ihexx/Shared_Partition/projects/ProjectBoom/CarlaEnv/vae/datasets.py”, line 42, in call
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/site-packages/torch/cuda/init.py”, line 163, in _lazy_init
RuntimeError: cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCGeneral.cpp:55
I then tried setting
mp.set_start_method('spawn') in the main thread before creating the dataloader object, but that throws a MemoryError, which I assume is because the processes creating the cuda tensors are exiting and releasing memory? Haven’t been able to get past this.
Here’s the traceback:
File “/media/ihexx/Shared_Partition/projects/ProjectBoom/CarlaEnv/vae/train_offline.py”, line 53, in train
dataiter = iter(trainloader)
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 193, in iter
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 469, in init
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/multiprocessing/process.py”, line 105, in start
self._popen = self._Popen(self)
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/multiprocessing/context.py”, line 223, in _Popen
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/multiprocessing/context.py”, line 284, in _Popen
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/multiprocessing/popen_spawn_posix.py”, line 32, in init
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/multiprocessing/popen_fork.py”, line 19, in init
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/multiprocessing/popen_spawn_posix.py”, line 47, in _launch
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/multiprocessing/reduction.py”, line 60, in dump