Assertion Error Distributed Data Parallel

I’m trying to use the distributed data parallel thing with pytorch and I’m running into this error which I haven’t found a solution for elsewhere:

-- Process 2 terminated with the following error: Traceback (most recent call last): File "/afs/", line 20, in _wrap fn(i, *args) File "../", line 5, in train_model model.train(args) File "../", line 227, in train for i, data_dict in enumerate(train_dataloader, 1): File "/afs/", line 363, in __next__ data = self._next_data() File "/afs/", line 402, in _next_data index = self._next_index() # may raise StopIteration File "/afs/", line 357, in _next_index return next(self._sampler_iter) # may raise StopIteration File "/afs/", line 208, in __iter__ for idx in self.sampler: File "/afs/", line 84, in __iter__ assert len(indices) == self.num_samples AssertionError

How would I approach fixing this problem?

I can comment out the assertion errors in, but I’m hoping there is a cleaner way to do this.

This error occurs in data loader. Could you please share a repro of the data loader?

cc @VitalyFedyunin for data loader questions

This also looks similar to (and proposal Do you use use DistributedSampler?

Yeah! It ended up being a problem with my data loader.

I did something and now it works, although I’m not sure what I did.

Could a possible problem be that the data loader was returning None for the elements occasionally?

On another note, I’m now getting a Exception: process 0 terminated with signal SIGSEGV when I run my model.