My system has 4 GPUs and I can successfully train a model (nn.Module) using DataParallel. I have verified with nvidia-smi that all four GPUs are equally utilized (by a few % points). Thus I have verified that DataParallel and the GPUs work on my system.

However, I cannot figure out how to efficiently use more than one GPU for a function in nn.functional. Subsequent to training, I need to run nn.functional.cosine_similarity. I can run this function on a single GPU (verified with nvidia-smi), but all attempts at using more than 1 GPU have failed.

Here is one such attempt. This ‘works’ in that it provides the correct result but it also takes 2.5 x longer than a single GPU and the 4 GPUs are not used evenly. One GPU has all the load and the rest are at 1%. I have used the same batch size as I use in training for this function, and I also experimented with larger batch sizes with no change in result.

```
class CosineSimilarityModel(nn.Module):
def forward(self, inputs):
tensor1, tensor2, dim = inputs
return F.cosine_similarity(tensor1, tensor2, dim=dim)
class TestClass():
def __init__(self, device):
csm = CosineSimilarityModel()
csm.to(device)
self.cosine_similarity = DataParallel(csm)
def get_similarities(self, lookup_tensor, test_tensor):
similarities = self.cosine_similarity( ( lookup_tensor, test_tensor, 2 ) )
return similarities
```

Does anyone have a working example of what to wrap the F.cosine_similarity function call in (passing two tensors and the dimension) to ensure even GPU utilization and expected near 4x speedup? Any mistakes in my posted attempt?