PyTorch DataParallel not using second GPU during Inference


I’m trying to run inference on a MMSR model. The system has two 2080Ti GPUs and I’m running PyTorch 1.1.0 on Ubuntu 18.04 with CUDA 10.

I have wrapped the model around a nn.DataParallel, and have tried setting and ensured that the two GPUs are visible
os.environ["CUDA_VISIBLE_DEVICES"] = "0, 1"

torch.cuda.device_count() returned 2.

Can anyone help me out? The inference script just goes CUDA OOM and doesn’t use the second GPU.

Code Snippet

model.load_state_dict(torch.load(model_path), strict=True)
model = nn.DataParallel(model, devices_ids=["0, 1"])                                                                 
model =

I’ve looked at all tutorials on the website and used it before, haven’t come across this behaviour. Not sure what I’m doing wrong.

For some more context, I’m trying to load one batch at a time of 5 pretty large images (no way around this) I have a vague feeling I can’t use this approach, so any suggestions would be appreicated.

Thank you so much.

Since your device runs out of memory, you would need to reduce the batch size.
Also, for inference you could warp the code in a with torch.no_grad() block to save some memory.

Thank you for the reply.
I am utilizing torch.no_grad() and I have to load 5 images per run of the algorithm.

The question was more around why PyTorch is not utilizing DataParallel as intended.

This is for (1) batch.

This is more a case of the model itself being large, would it require explicit ModelParallel sort of optimization?

Does this mean you are using a batch size of 1?
If so, nn.DataParallel cannot be used, as the batch will be chunked in the batch dimension (dim0) and each chunk will be scattered to the corresponding device.

If the model is too large for a single GPU, you could use model parallel, yes.

However, I’m currently unsure which use cases work.
What’s the largest batch size for a single GPU and what batch sizes are you using for the data parallel approach?

I’m terribly sorry for not noticing this message earlier.

The largest batch size is 1.
But 1 batch contains 5 images which are of a certain resolution say [1920, 1080]
So the shape of one batch is [1, 5, 3, 1920, 1080].

Downsampling the image in this situation is not possible due to objective, and the per forward pass all 5 images are required. I think the only way to address this problem seems to be to increase the GPU VRAM, but I am already tried using a Tesla V100.

I was wondering, if considering there are two GPUs with say 16 GB + 16 GB, is there a way to distribute inference across both of them (via ModelParallel) or otherwise, to potentially process them this way?

And is there any other way to deal with this? I was interested if this was possible.

Yes, this would be possible.Here is a simple example of model sharding.
Basically you can push submodules to specific devices and would have to make sure to push the activation in the forward method to the right device.

Let me know, if you get stuck or need more information.

The example looks great, I’ll try it out and get back to you!

Thank you :smiley: