System is getting hang while using multiple gpu`s in inference

I have trained my model on a single gpu machine while training i have wrapped my model with torch.nn.DataParallel class. I called the training with the command CUDA_VISIBLE_DIVICES=0 python train.py.

Training went fine but when i tried to do inference on this model from the command CUDA_VISIBLE_DIVICES=0,1 python test.py it’s getting hang.

Inference is working fine when i call single gpu with command CUDA_VISIBLE_DIVICES=0 python test.py.

Was your setup working before using multiple GPUs?
If so, do you know what changed, e.g. did you update the driver or anything else?
If not, are you able to communicate between different devices and e.g. send a tensor from A to B?

x = torch.randn(1, device="cuda:0")
y = x.to("cuda:1")

If not, could you check if IOMMU is enabled and disable it if needed as described here?

Btw. we recommend using DistributedDataParallel for a better performance as DataParallel is also in maintenance mode.

The setup was working well before using multiple GPUs, and GPUs are able to communicate between devices.

if I do not give the environment variable CUDA_VISIBLE_DIVICES then it throws below error:

ERROR:root:Caught TypeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
TypeError: forward() missing 2 required positional arguments: 'input' and 'text'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/LPR/__init__.py", line 138, in feedforward
    batch_plates = self.license_plate.fetch_plate_number(batch, self.long_size)
  File "/opt/conda/lib/python3.8/site-packages/LPR/__init__.py", line 102, in fetch_plate_number
    detected_text_dict = self.ocr.fetch_text(img_association[item])
  File "/opt/conda/lib/python3.8/site-packages/LPR/OCR/__init__.py", line 172, in fetch_text
    preds = self.model(image, text_for_pred, is_train=False) #torch.unsqueeze(image,0)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 543, in reraise
    raise exception
TypeError: Caught TypeError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
TypeError: forward() missing 2 required positional arguments: 'input' and 'text'

DEBUG:root:Entity ended here with no valid Licence plate detection and id is: my_id

That’s an interesting error. Does a single GPU run? If so, could you try to use DDP as it’s also the supported distributed backend?

it is running fine in single GPU machine no matter if i use the environment variable CUDA_VISIBLE_DIVICES or not.