Hi! I’ve been trying to use nn.DataParallel
for my model but am encountering this error:
Traceback (most recent call last):
File "train.py", line 382, in <module>
main(args, hparams)
File "train.py", line 373, in main
train(model, train_dataset, val_dataset, retrieved_tokens, args.output_dir, hparams, device)
File "train.py", line 314, in train
train_loss, train_accuracy, train_per_token_accuracy, train_perplexity = train_epoch(model, train_loader, retrieved_tokens, optimizer, hparams, device)
File "train.py", line 236, in train_epoch
result = model(x_tokens, retrieved, mask_tokens)
File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1148, in _call_impl
result = forward_call(*input, **kwargs)
File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 158, in forward
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 175, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 44, in scatter_kwargs
inputs = scatter(inputs, target_gpus, dim) if inputs else []
File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 36, in scatter
res = scatter_map(inputs)
File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 23, in scatter_map
return list(zip(*map(scatter_map, obj)))
File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 19, in scatter_map
return Scatter.apply(target_gpus, None, dim, obj)
File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 96, in forward
outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
File "/home/yeh803/.conda/envs/plm/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 189, in scatter
return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: CUDA error: peer mapping resources exhausted
I tried using 8 GPUs and 4 GPUs and the problem persists. I also tried to export NCCL_P2P_DISABLE=1
, but it didn’t work either. I have 80GB free RAM so CPU mem shouldn’t be a problem. Any help would be much appreciated!