Hi! I want to use torch.multiuprocessing
to speed-up my training process. In short, the original training structure is as below.
train_loader = DataLoader(train_dataset, batch_size=train_batch, shuffle=True)
model = Model(...)
optimizer = optim.SGD(model.parameters(), ...)
for i in range(epochs):
for pos, neg in enumerate(train_loader):
pos = pos.to(device).to(torch.long)
neg = neg.to(device).to(torch.long)
optimizer.zero_grad()
loss = model(pos, neg)
loss.backward()
optimizer.step()
I want to use multiprocessing in each epoch, so I changed my code structure into this:
def train(model, train_loder, device):
optimizer = optim.SGD(model.parameters(), ...)
for pos, neg in enumerate(train_loader):
pos = pos.to(device).to(torch.long)
neg = neg.to(device).to(torch.long)
optimizer.zero_grad()
loss = model(pos, neg)
loss.backward()
optimizer.step()
train_loader = DataLoader(train_dataset, batch_size=train_batch, shuffle=True)
model = Model(...)
for i in range(epochs):
processes = []
for rank in range(num_processes):
p = mp.Process(target=train, args=(model, train_loader, device))
p.start()
processes.append(p)
for p in processes:
p.join()
But when I ran, I got errors I don’t understand: (My system is Win10)
THCudaCheck FAIL file=C:\w\1\s\windows\pytorch\torch/csrc/generic/StorageSharing.cpp line=245 error=71 : operation not supported
Traceback (most recent call last):
File "D:/PyCharm/Projects/TransE/train.py", line 113, in <module>
p.start()
File "D:\Python3.7.3\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "D:\Python3.7.3\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "D:\Python3.7.3\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "D:\Python3.7.3\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
reduction.dump(process_obj, to_child)
File "D:\Python3.7.3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "D:\Python3.7.3\lib\site-packages\torch\multiprocessing\reductions.py", line 232, in reduce_tensor
event_sync_required) = storage._share_cuda_()
RuntimeError: cuda runtime error (71) : operation not supported at C:\w\1\s\windows\pytorch\torch/csrc/generic/StorageSharing.cpp:245
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "D:\Python3.7.3\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "D:\Python3.7.3\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
Is my structure wrong? And how to fix it? Thanks!
PS: It seems that there are not many tutorials on how to use torch.multiprocessing. And I am not familiar with multiprocessing …