IndexError: Caught IndexError in replica 0 on device 0

h_xi · April 8, 2022, 1:17pm

This error occurs when I train my model with 8-gpu, and I don’t know how to fix it.

Traceback (most recent call last):
…
File “/root/anaconda3/envs/SSGait/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/root/anaconda3/envs/SSGait/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/root/anaconda3/envs/SSGait/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py”, line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/root/anaconda3/envs/SSGait/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 86, in parallel_apply
output.reraise()
File “/root/anaconda3/envs/SSGait/lib/python3.7/site-packages/torch/_utils.py”, line 428, in reraise
raise self.exc_type(msg)
IndexError: Caught IndexError in replica 0 on device 0.
Original Traceback (most recent call last):
File “/root/anaconda3/envs/SSGait/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py”, line 61, in _worker
output = module(*input, **kwargs)
File “/root/anaconda3/envs/SSGait/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/root/hxi/Gait Recognition/SSGait0.4/models/vgg_c3d.py”, line 187, in forward
feature = feature.matmul(self.fc_bin[0])
File “/root/anaconda3/envs/SSGait/lib/python3.7/site-packages/torch/nn/modules/container.py”, line 426, in getitem
idx = self._get_abs_string_index(idx)
File “/root/anaconda3/envs/SSGait/lib/python3.7/site-packages/torch/nn/modules/container.py”, line 409, in _get_abs_string_index
raise IndexError(‘index {} is out of range’.format(idx))
IndexError: index 0 is out of range

I guess this error might be related to this statement in my model, but I don’t know how to fix it:
def init(self):
self.fc_bin = nn.ParameterList([
nn.Parameter(
nn.init.xavier_uniform_(
torch.zeros(sum(self.bin_numgl), _set_channels[2], _set_channels[2])))
])
def forward(self, x):
feature = feature.matmul(self.fc_bin[0])

When this code is commented out, it runs successfully: feature = feature.matmul(self.fc_bin[0])

Versions

cuda ==11.0 cudatoolkit== 11.0 pytorch==1.7.1

ptrblck · April 8, 2022, 4:33pm

It seems the line of code is failing with:

IndexError: index 0 is out of range

so make sure self.fc_bin has the correct shape to be indexed by 0.

h_xi · April 9, 2022, 2:21am

self.fc_bin[0].shape is [64,128,128]
feature.shape is [64,16,128]
so i think ‘feature = feature.matmul(self.fc_bin[0])‘
self.fc_bin has the correct shape
Note that I can train normally with a single gpu, but this error occurs when using 8-gpus

ptrblck · April 9, 2022, 5:21am

I guess the shape might be different (and invalid) in the 8 GPU case, which raises the error.
If you get stuck, please post a minimal, executable code snippet to reproduce the error.