Hello, I’m trying to load my data with DistributedSampler
class in order to train model on multiple GPUs. The model is wrapped with DistributedDataParallel
. The data is successfully loaded on my 2x GPUs. Here my code snippets:
# distributed learning
if torch.cuda.device_count() > 1:
model = torch.nn.parallel.DistributedDataParallel(self.net, device_ids=[range(self.num_gpus)])
else:
model = self.net
iteration = infos["iteration"]
epoch_start = infos["epoch"]
model.train()
for epoch in range(epoch_start, cfg.TRAIN.MAX_EPOCH):
self.setup_dataloader(epoch=epoch)
for _, blobs in enumerate(self.loader):
print("blobs.size", len(blobs))
print(blobs)
loss_dict = model.forward(blobs)
blobs is list of dicts which include tensors, objects in images + other additional information (It’s a object detection task based on Faster CNN).
After calling model.forward(blobs), there is a error reported as:
TypeError: list indices must be integers or slices, not range
The corresponding traceback of this error:
Traceback (most recent call last):
File "tools/train.py", line 456, in <module>
trainer.train(args)
File "tools/train.py", line 372, in train
loss_dict = model.forward(blobs)
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 445, in forward
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 471, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 36, in scatter_kwargs
inputs = scatter(inputs, target_gpus, dim) if inputs else []
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 28, in scatter
res = scatter_map(inputs)
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
return list(zip(*map(scatter_map, obj)))
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 17, in scatter_map
return list(map(list, zip(*map(scatter_map, obj))))
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 19, in scatter_map
return list(map(type(obj), zip(*map(scatter_map, obj.items()))))
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
return list(zip(*map(scatter_map, obj)))
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 13, in scatter_map
return Scatter.apply(target_gpus, None, dim, obj)
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 88, in forward
streams = [_get_stream(device) for device in target_gpus]
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 88, in <listcomp>
streams = [_get_stream(device) for device in target_gpus]
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 115, in _get_stream
if _streams[device] is None:
TypeError: list indices must be integers or slices, not range
Traceback (most recent call last):
File "tools/train.py", line 456, in <module>
trainer.train(args)
File "tools/train.py", line 372, in train
loss_dict = model.forward(blobs)
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 445, in forward
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 471, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 36, in scatter_kwargs
inputs = scatter(inputs, target_gpus, dim) if inputs else []
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 28, in scatter
res = scatter_map(inputs)
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
return list(zip(*map(scatter_map, obj)))
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 17, in scatter_map
return list(map(list, zip(*map(scatter_map, obj))))
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 19, in scatter_map
return list(map(type(obj), zip(*map(scatter_map, obj.items()))))
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
return list(zip(*map(scatter_map, obj)))
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 13, in scatter_map
return Scatter.apply(target_gpus, None, dim, obj)
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 88, in forward
streams = [_get_stream(device) for device in target_gpus]
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 88, in <listcomp>
streams = [_get_stream(device) for device in target_gpus]
File "/vol/.conda/envs/.env36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 115, in _get_stream
if _streams[device] is None:
TypeError: list indices must be integers or slices, not range
As far as I know that if the input of model is tensor data, there will be no problem to train model on mutliple GPUs distributedly. Might it be possible that a list is employed to pass the data in model.forward() methods.
It works if I launch the model only on single GPU.
Thanks in advance.