I have an image dataset that doesn’t fit in memory. I want to read minibatches off disk, copy them to GPU and train a model on them.
PyTorch’s DataLoader has been very helpful in hiding the cost of loading the minibatch with multithreading, but copying to the GPU is still sequential
I’m trying to pipeline my training loop such that copying data to the GPU happens in parallel with the rest (forward pass, backward backprop, etc) (something like this).
I summarise what I have tried so far below, but I think I’m going far down the wrong path with it. What is the correct way to do this in pyTorch?
What I’ve tried:
so far my training loop looks like this:
trainset = MemmapDataset("dataset.npy")
trainloader = DataLoader(trainset, batch_size=param.batch_size,
shuffle=True, num_workers=4,pin_memory=True)
for i in range(epochs):
for features,labels in train_loader:
features = features.to(device)
labels = labels.to(device)
predictions = neural_net(features)
... rest of training ...
I want to change that to(device)
operation such that it runs concurrently with the rest of training; the copy operation should run asynchronously in on one of the cuda streams while the other training kernels run , instead of right now, where all the training stops, and the GPU is essentially idle just copying data
I tried making to(device)
one of the transforms that runs in my dataset class’ __getitem__
method as so:
class ToDeviceTransform:
def __init__(self, device):
self.device = device
def __call__(self, data: torch.Tensor):
return data.contiguous().to(self.device)
and this works if num_workers=0
and pinned_memory=False
in the dataloader, but this is still just a sequential copy, and its slower because now I lose the parallel load from disk.
setting num_workers >0 throws the following runtime error at my ToDeviceTransform.call() hack:
------------- | Cut long traceback above | -------------------
File “/media/ihexx/Shared_Partition/projects/ProjectBoom/CarlaEnv/vae/datasets.py”, line 42, in call
return data.contiguous().to(self.device)
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/site-packages/torch/cuda/init.py”, line 163, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCGeneral.cpp:55
I then tried setting mp.set_start_method('spawn')
in the main thread before creating the dataloader object, but that throws a MemoryError, which I assume is because the processes creating the cuda tensors are exiting and releasing memory? Haven’t been able to get past this.
Here’s the traceback:
File “/media/ihexx/Shared_Partition/projects/ProjectBoom/CarlaEnv/vae/train_offline.py”, line 53, in train
dataiter = iter(trainloader)
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 193, in iter
return _DataLoaderIter(self)
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 469, in init
w.start()
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/multiprocessing/process.py”, line 105, in start
self._popen = self._Popen(self)
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/multiprocessing/context.py”, line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/multiprocessing/context.py”, line 284, in _Popen
return Popen(process_obj)
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/multiprocessing/popen_spawn_posix.py”, line 32, in init
super().init(process_obj)
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/multiprocessing/popen_fork.py”, line 19, in init
self._launch(process_obj)
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/multiprocessing/popen_spawn_posix.py”, line 47, in _launch
reduction.dump(process_obj, fp)
File “/home/ihexx/anaconda3/envs/boom/lib/python3.6/multiprocessing/reduction.py”, line 60, in dump
ForkingPickler(file, protocol).dump(obj)
MemoryError