Error when using distributedsampler

CgYoung · December 25, 2018, 2:36am

What’s the meaning of RuntimeError: Expected a 'N2at13CUDAGeneratorE' but found 'PN2at9GeneratorE'

Traceback (most recent call last):
  File "train_cosine_iterations_distributed.py", line 335, in <module>
    batch_iterator = iter(train_loader)
  File "/home/ycg/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 819, in __iter__
    return _DataLoaderIter(self)
  File "/home/ycg/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 584, in __init__
    self._put_indices()
  File "/home/ycg/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 646, in _put_indices
    indices = next(self.sample_iter, None)
  File "/home/ycg/anaconda3/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 160, in __iter__
    for idx in self.sampler:
  File "/home/ycg/anaconda3/lib/python3.6/site-packages/torch/utils/data/distributed.py", line 45, in __iter__
    indices = torch.randperm(len(self.dataset), generator=g).tolist()
RuntimeError: Expected a 'N2at13CUDAGeneratorE' but found 'PN2at9GeneratorE'

CgYoung · December 25, 2018, 2:46am

Is there something wrong with torch.Generator()?

def __iter__(self):
    # deterministically shuffle based on epoch
    g = torch.Generator()
    g.manual_seed(self.epoch)
    indices = torch.randperm(len(self.dataset), generator=g).tolist()

    # add extra samples to make it evenly divisible
    indices += indices[:(self.total_size - len(indices))]
    assert len(indices) == self.total_size

    # subsample
    indices = indices[self.rank:self.total_size:self.num_replicas]
    assert len(indices) == self.num_samples

    return iter(indices)

smth · December 28, 2018, 10:48pm

I tried to reproduce your issue with:

import torch

g = torch.Generator()
g.manual_seed(1)
indices = torch.randperm(100, generator=g).tolist()

It ran without error (I am on PyTorch 1.0.0).
Can you try to give me a reproduction so that I can help figure out what’s going on?

CgYoung · January 5, 2019, 11:09am

len(self.dataset) == 118,287, I run the code to train on MS COCO dataset

bdusell · January 9, 2019, 7:40am

I’m running into this error too, but only when using CUDA. Here’s a minimal script to reproduce:

import torch
device = torch.device('cuda')
generator = torch.manual_seed(123)
x = torch.zeros(10, device=device)
x.uniform_(-1.0, 1.0, generator=generator)

Output:

$ python bug.py 
Traceback (most recent call last):
  File "bug.py", line 5, in <module>
    x.uniform_(-1.0, 1.0, generator=generator)
RuntimeError: Expected a 'N2at13CUDAGeneratorE' but found 'PN2at9GeneratorE'

Same thing happens if I use torch.Generator. By the way, is torch.Generator documented anywhere?

sftb644517236 · July 2, 2019, 11:52am

Bro, Have you solve this problem? I meet this error when I use torch.nn.parallel.DistributedDataParallel.
the problem is from sampler = DistributedSampler(dataset) and
data.DataLoader(dataset,…, sampler=sampler)

Here is log.

Traceback (most recent call last):
File “train.py”, line 152, in
sampler=sampler))
File “/home/xdjf/.conda/envs/py36torch041/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 501, in iter
return _DataLoaderIter(self)
File “/home/xdjf/.conda/envs/py36torch041/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 297, in init
self._put_indices()
File “/home/xdjf/.conda/envs/py36torch041/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 345, in _put_indices
indices = next(self.sample_iter, None)
File “/home/xdjf/.conda/envs/py36torch041/lib/python3.6/site-packages/torch/utils/data/sampler.py”, line 138, in iter
for idx in self.sampler:
File “/home/xdjf/.conda/envs/py36torch041/lib/python3.6/site-packages/torch/utils/data/distributed.py”, line 42, in iter
indices = list(torch.randperm(len(self.dataset), generator=g))
RuntimeError: Expected a ‘N2at13CUDAGeneratorE’ but found ‘PN2at9GeneratorE’