I am getting cuda run time errors depends on the number of gpu used. It looks like the problem occurs when I am training customised model, which uses pretained Resnet34 as a part, on multiple gpus. Couldn’t really find out what was the problem.
I have 3 models:
Model A: customised model
Model B: pretrained Resnet34
Model C: Model A + some_linear_layers + Model B
I am on:
NVIDIA-SMI 450.51.06
Driver Version: 450.51.06
CUDA Version: 11.2
GPU: 4GPU-P100
Torch: 1.8.1+cu111
Torchvision: 0.9.1+cu111
Python: 3.8.10
I am able to run Model A batch size 16 with 4 GPU, with below usage:
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000001:00:00.0 Off | 0 |
| N/A 34C P0 74W / 250W | 10133MiB / 16280MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000002:00:00.0 Off | 0 |
| N/A 33C P0 114W / 250W | 10093MiB / 16280MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 00000003:00:00.0 Off | 0 |
| N/A 32C P0 110W / 250W | 10093MiB / 16280MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 00000004:00:00.0 Off | 0 |
| N/A 35C P0 104W / 250W | 10093MiB / 16280MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
I am also able to fine tune Model B batch size 16 with 4 GPU, with below usage:
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000001:00:00.0 Off | 0 |
| N/A 34C P0 133W / 250W | 3535MiB / 16280MiB | 83% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000002:00:00.0 Off | 0 |
| N/A 31C P0 131W / 250W | 3357MiB / 16280MiB | 50% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 00000003:00:00.0 Off | 0 |
| N/A 30C P0 79W / 250W | 1559MiB / 16280MiB | 50% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 00000004:00:00.0 Off | 0 |
| N/A 34C P0 67W / 250W | 1589MiB / 16280MiB | 57% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
However, when it comes to Model C (Model A + some_linear_layers + Model B ) batch size 16 4 GPU, I am having Runtime Error:
Traceback (most recent call last):
File "train_selfvit.py", line 139, in <module>
main(config)
File "train_selfvit.py", line 78, in main
trainer.train()
File "/attentionKPTs/base/base_trainer.py", line 71, in train
result = self._train_epoch(epoch)
File "/attentionKPTs/trainer/trainer.py", line 70, in _train_epoch
street_cls, sat_cls, shift_out, angle_out = self.model(sat, street, self.device)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/attentionKPTs/model/model.py", line 327, in forward
sat_feature = self.sat_processor(sat_im_feature).unsqueeze(1)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torchvision/models/resnet.py", line 249, in forward
return self._forward_impl(x)
File "/usr/local/lib/python3.8/dist-packages/torchvision/models/resnet.py", line 240, in _forward_impl
x = self.layer4(x)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torchvision/models/resnet.py", line 74, in forward
out = self.conv2(out)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: misaligned address
But I don’t think it is OOM problem, as if I run Model C batch size 4 or 6 on 1 GPU it got sufficient memory:
For bs=4
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000001:00:00.0 Off | 0 |
| N/A 36C P0 100W / 250W | 10555MiB / 16280MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000002:00:00.0 Off | 0 |
| N/A 28C P0 25W / 250W | 2MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
For bs=6
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000001:00:00.0 Off | 0 |
| N/A 35C P0 138W / 250W | 15143MiB / 16280MiB | 93% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000002:00:00.0 Off | 0 |
| N/A 28C P0 25W / 250W | 2MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
When try to train batch size 4 on 2 GPU got RuntimeError: CUDA error: misaligned address
when calculating the loss.
However,
When try batch size 4 on 4 GPU, it could run with low GPU usage:
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000001:00:00.0 Off | 0 |
| N/A 31C P0 47W / 250W | 3861MiB / 16280MiB | 67% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 00000002:00:00.0 Off | 0 |
| N/A 29C P0 65W / 250W | 3593MiB / 16280MiB | 65% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 00000003:00:00.0 Off | 0 |
| N/A 29C P0 77W / 250W | 3621MiB / 16280MiB | 65% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 00000004:00:00.0 Off | 0 |
| N/A 32C P0 69W / 250W | 3593MiB / 16280MiB | 43% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Any thought?