How do you use DataParallel on a nn.functional?

My system has 4 GPUs and I can successfully train a model (nn.Module) using DataParallel. I have verified with nvidia-smi that all four GPUs are equally utilized (by a few % points). Thus I have verified that DataParallel and the GPUs work on my system.

However, I cannot figure out how to efficiently use more than one GPU for a function in nn.functional. Subsequent to training, I need to run nn.functional.cosine_similarity. I can run this function on a single GPU (verified with nvidia-smi), but all attempts at using more than 1 GPU have failed.

Here is one such attempt. This ‘works’ in that it provides the correct result but it also takes 2.5 x longer than a single GPU and the 4 GPUs are not used evenly. One GPU has all the load and the rest are at 1%. I have used the same batch size as I use in training for this function, and I also experimented with larger batch sizes with no change in result.

class CosineSimilarityModel(nn.Module):
    def forward(self, inputs):
        tensor1, tensor2, dim = inputs
        return F.cosine_similarity(tensor1, tensor2, dim=dim)

class TestClass():
    def __init__(self, device):
        csm = CosineSimilarityModel()
        csm.to(device)
        self.cosine_similarity = DataParallel(csm)

    def get_similarities(self, lookup_tensor, test_tensor):
        similarities = self.cosine_similarity( ( lookup_tensor, test_tensor, 2 ) )
        return similarities

Does anyone have a working example of what to wrap the F.cosine_similarity function call in (passing two tensors and the dimension) to ensure even GPU utilization and expected near 4x speedup? Any mistakes in my posted attempt?

I see that dataparallel works as expected here:

import torch
from torch import nn
import torch.nn.functional as F
from torch.nn import DataParallel

class CosineSimilarityModel(nn.Module):
    def forward(self, inputs):
        tensor1, tensor2, dim = inputs
        print(tensor1.shape, tensor2.shape, dim)
        print(tensor1.device, tensor2.device)
        return F.cosine_similarity(tensor1, tensor2, dim=dim)

class TestClass():
    def __init__(self, device):
        csm = CosineSimilarityModel()
        csm.to(device)
        self.cosine_similarity = DataParallel(csm)

    def get_similarities(self, lookup_tensor, test_tensor):
        similarities = self.cosine_similarity( ( lookup_tensor, test_tensor, 2 ) )
        return similarities

t = TestClass('cuda')
t1 = torch.randn(64, 32, 32, device='cuda')
t2 = torch.randn(64, 32, 32, device='cuda')
t.get_similarities(t1, t2)

produces

torch.Size([16, 32, 32]) torch.Size([16, 32, 32]) 2
cuda:0 cuda:0
torch.Size([16, 32, 32]) torch.Size([16, 32, 32]) 2
cuda:1 cuda:1
torch.Size([16, 32, 32]) torch.Size([16, 32, 32]) 2
cuda:2 cuda:2
torch.Size([16, 32, 32]) torch.Size([16, 32, 32]) 2
cuda:3 cuda:3

Note that you might get somewhat better performance with distributed data parallel instead: DistributedDataParallel — PyTorch 2.0 documentation

However I’m not sure you would expect to see a significant speedup as cosine similarity is likely a bandwidth-bound operation (the amount of computation done relative to the data transfer is very small). It could be that for such a small operation the data transfer time dominates the end-to-end runtime.

Thank you for verifying my code worked. My experience is that it provides the same results as calling F.cosine_similarity, it just takes much longer. I have timed and profiled my code and know that with one GPU calculating the cosine similarities takes a little over 1200 seconds, but with the code example I provided and 4 GPUs it takes more than 3000 seconds. As an aside, training my module for one epoch with the exact same number of records only takes 260 seconds!

As I earlier mentioned, in my code example only one GPU is given any real load, the other 3 GPUs are at 1% utilization. In the profile I see a lot of time spent on gather, scatter, scatter_gather, and similar overhead operations. This is puzzling because when I only use DataParallel to train my module, the profiled overhead cost is two orders of magnitude lower.

Do you have any insight as to why only one GPU is used when 4 are available? Any insight as to why the overhead operations are so much higher for the cosine similarity operation vs. training the model when the exact same records are provided?

I would prefer to keep this using DataParallel rather than DistributedDataParallel. I had earlier reviewed that documentation and it is daunting! It is like a 3 credit graduate class - I just want a couple of lines of code to make PyTorch use all the GPUs.

So the puzzles to crack are why is overhead so much higher than when training with the same data and why aren’t the GPUs actually being used?