Transfer learning Tutorial - RuntimeError: CUDA error: device-side assert triggered

dnv · November 2, 2021, 9:22pm

Hello I am currently learning from the Transfer learning tutorial provided on the Pytorch webside.
I ran into this error while trying to train the model.
https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html

RuntimeError Traceback (most recent call last)
in
5 model_ft.fc = nn.Linear(num_ftrs, 2)
6
----> 7 model_ft = model_ft.to(device)
8
9 criterion = nn.CrossEntropyLoss()

~/venvs/p_donov/lib/python3.8/site-packages/torch/nn/modules/module.py in to(self, *args, **kwargs)
605 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
606
→ 607 return self._apply(convert)
608
609 def register_backward_hook(

~/venvs/p_donov/lib/python3.8/site-packages/torch/nn/modules/module.py in _apply(self, fn)
352 def _apply(self, fn):
353 for module in self.children():
→ 354 module._apply(fn)
355
356 def compute_should_use_set_data(tensor, tensor_applied):

~/venvs/p_donov/lib/python3.8/site-packages/torch/nn/modules/module.py in _apply(self, fn)
374 # with torch.no_grad():
375 with torch.no_grad():
→ 376 param_applied = fn(param)
377 should_use_set_data = compute_should_use_set_data(param, param_applied)
378 if should_use_set_data:

~/venvs/p_donov/lib/python3.8/site-packages/torch/nn/modules/module.py in convert(t)
603 if convert_to_format is not None and t.dim() == 4:
604 return t.to(device, dtype if t.is_floating_point() else None, non_blocking, memory_format=convert_to_format)
→ 605 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
606
607 return self._apply(convert)

RuntimeError: CUDA error: device-side assert triggered

It is clearly an indexing error. I tried to run in only on my cpu and i run only once. Afterwards was coming this error msg:

IndexError Traceback (most recent call last)
in
----> 1 model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler,
2 num_epochs=25)

in train_model(model, criterion, optimizer, scheduler, num_epochs)
32 outputs = model(inputs)
33 _, preds = torch.max(outputs, 1)
—> 34 loss = criterion(outputs, labels)
35
36 # backward + optimize only if in training phase

~/venvs/p_donov/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
720 result = self._slow_forward(*input, **kwargs)
721 else:
→ 722 result = self.forward(*input, **kwargs)
723 for hook in itertools.chain(
724 _global_forward_hooks.values(),

~/venvs/p_donov/lib/python3.8/site-packages/torch/nn/modules/loss.py in forward(self, input, target)
945
946 def forward(self, input: Tensor, target: Tensor) → Tensor:
→ 947 return F.cross_entropy(input, target, weight=self.weight,
948 ignore_index=self.ignore_index, reduction=self.reduction)
949

~/venvs/p_donov/lib/python3.8/site-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction)
2420 if size_average is not None or reduce is not None:
2421 reduction = _Reduction.legacy_get_string(size_average, reduce)
→ 2422 return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
2423
2424

~/venvs/p_donov/lib/python3.8/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
2216 .format(input.size(0), target.size(0)))
2217 if dim == 2:
→ 2218 ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
2219 elif dim == 4:
2220 ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

IndexError: Target 2 is out of bounds.

ptrblck · November 3, 2021, 6:44am

The raised error points towards an invalid index in the target:

IndexError: Target 2 is out of bounds.

In a multi-class classification using nn.CrossEntropyLoss the model output is expected to have the shape [batch_size, nb_classes] while the target should have the shape [batch_size] and contain class indices in the range [0, nb_classes-1].
Since your target contains a class index of 2 this would mean that you are working with at least 3 different classes (class indices would be [0, 1, 2]) and your model output should thus have the shape [batch_size, >=3], which doesn’t seem to be the case.

dnv · November 16, 2021, 1:39am

I have another question regarding this example. Is the used network prototypical or matching network? While reading papers usually it is distinguished between matching and prototypical. It is also mentioned in the paper from (Snell et al., NIPS 2017) Prototypical Networks for Few-shot Learning, that prototypical networks are better for training. Could somebody explain me how do I know with which kind of network am i dealing with?
Thanks

ptrblck · November 16, 2021, 9:14am

I don’t know what the difference is and would thus recommend to create a new topic with this question.