RuntimeError: Tensors must be CUDA and dense

I face this when using torch.nn.parallel.DistributedDataParallel(pytorch 1.4.0), and also using below

device = torch.device(“cuda:0” if torch.cuda.is_available() else “cpu”)
tensor = torch.zeros(*shape, device=device).scatter_add(1, segment_ids, data)

File “/home/gezi/mine/pikachu/utils/melt/eager/train.py”, line 1398, in train
loss.backward()
File “/home/gezi/env/anaconda3/lib/python3.6/site-packages/torch/tensor.py”, line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/gezi/env/anaconda3/lib/python3.6/site-packages/torch/autograd/init.py”, line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Tensors must be CUDA and dense

How to slove this ? I tried many such as
tensor = torch.zeros(*shape).cuda().scatter_add(1, segment_ids, data)
but this only works for DataParallel not DistributedDataParallel.

Another problem of DIstributedDataParallel is each process using all gpus like below, is this by design ?

Are you using sparse tensors?

Another problem of DIstributedDataParallel is each process using all gpus like below, is this by design ?

How did you construct DDP? You need to either set device_ids arg properly or use the CUDA_VISIBLE_DEVICES env var to configure that, and make sure no DDP processes share the same GPU. Otherwise, each process will try to use all visible devices, and when two DDP process share a GPU, it could hang.

  1. Yes I’m using Embedding with arg sparse=True. But seems not ok to run DDP only if I using scatter_add later.
  2. If using 2 processes to run DDP. Then I set CUDA_VISIBLE_DEVICE=0,1 for each prcoess.
    code like below
    rank = dist.get_rank()
    device = torch.device(‘cuda’, rank)
    model = model.to(device)
    model = torch.nn.parallel.DistributedDataParallel(model,device_ids=[rank],output_device=rank)
    I tried to launch each process with CUDA_VISIBLE_DEVICE=0 CUDA_VISIBLE_DEVICE=1 but seems not work.

@mrshenli Well the second problem is due to I’m using tf dataset eager mode to read data first then convert to torch tensors, the problem has been solved.
For the first I find yes it is due to using sparse not related to scatter_add. So the problem is the same as DistributedDataParallel Sparse Embeddings