Hi guys I read a lot about how to distribute my training over multiple GPUs in pytorch Documentations. And i am confused, I had to read about multiprocessing in general(because i didn’t have any knowledge about it ) I don’t know if my code is right or not but every time i run it i see error
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525812548180/work/aten/src/THC/THCTensorRandom.cu line=25 error=46 : all CUDA-capable devices are busy or unavailable
And
RuntimeError: cuda runtime error (46) : all CUDA-capable devices are busy or unavailable at /opt/conda/conda-bld/pytorch_1525812548180/work/aten/src/THC/THCTensorRandom.cu:25
def run(rank, size):
device = torch.device("cuda")
dataset = datasets.MNIST('./data', train=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))]))
# from torch.utils.data.distributed import DistributedSampler
train_sampler = DistributedSampler(dataset)
train_loader = torch.utils.data.DataLoader(dataset, batch_size=128,
num_workers=4,
pin_memory=True,
sampler=train_sampler)
torch.manual_seed(1234)
model = Net().cuda()
model = torch.nn.parallel.DistributedDataParallel(model)
optimizer = optim.SGD(model.parameters(),
lr=0.01, momentum=0.5)
for epoch in range(10):
epoch_loss = 0.0
train_sampler.set_epoch(epoch)
for data, target in train_loader:
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
epoch_loss += loss.item()
loss.backward()
optimizer.step()
print('Rank ', dist.get_rank(), ', epoch ',
epoch, ': ', epoch_loss / num_batches)
def init_processes(rank, size, fn, backend='gloo'):
""" Initialize the distributed environment. """
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group(backend, rank=rank, world_size=size)
fn(rank, size)
size = 4
processes = []
for rank in range(size):
p = Process(target=init_processes, args=(rank, size, run))
processes.append(p)
p.start()
I also read distributed imagenet example to try to do the same.
Is that the correct way of using distributed training?
PS: btw the first error appears 3 times the second error appears 4 times(I only took the last line of the second error). Both errors appear at the same time
I also tried to google that error but I could not use it to fix my error