Pytorch is slower on V100 than on T4

Hi, I am running experiments on AWS g4dn.metal (has 8 T4 GPUs) instance and p3.16xlarge (has 8 V100 GPUs).

I use same OS (ubuntu 18), same conda environment (python 3.7, pytorch 1.10, cudatookit 11.0). I got wierd problems:

When I run network A (it has conv and FC layers), T4 is as more than 2 times fast as V100. When I run network B (it has conv, FC and attention module). V100 is as about 2 times fast as T4. The network B has much more params than the network A (about 1.5 times params).

Why does this happen? I think V100 should always faster than T4. I tried torch.backends.cudnn.benchmark torch.backends.cudnn.enabled flags to get faster on V100, but no luck. Can anyone help me? Thank you.

Could you share more details such as the benchmark code you are running?

However, I’m not sure there are any guarantees that V100 will outperform T4 on all benchmarks, considering that they are of similar compute capability (7.0 and 7.5 respectively) and T4 potentially has higher boost clocks than V100 on paper. The latter point could be relevant if e.g., your workload is not fully saturating the compute/memory bandwidth of the GPUs. The intuition is that V100 is a “wider” GPU than T4, but if your workload is small, there might not be enough parallelism to sufficiently leverage the hardware.

Thanks for reply.

It is hard for me to share all the codes, but I can give a brief intro here. I have audio and video input.

For audio input, I use a CNN subnetwork. In this subnetwork, there are 4 cnn blocks. In each block, it is a Conv2d layer, a batchnorm layer, RELU activation function and a maxpool layer.

For video input, I also use a CNN subnetwork. In this subnetwork, there are 4 cnn blocks, too. In each block, it is a Conv3d layer, a batchnorm layer, RELU, and a maxpool layer.

The outputs from these two subnetworks were concatenated and sent to 2 FC layers. That’s it.

BTW, I am using parallel model following:
nn.DataParallel(model).to(device)

I also check the nvidia-smi. It shows 8 GPUS has workload periodically – It has GPU-util about 10-20% then turns to 0, after a while it goes to 10-20% … in such a cycle. For both T4 and V100 it has the same fashion. But the T4 is much faster (about half running time).

If the GPU utilization is so low, have you looked into removing other potential bottlenecks during your benchmarking (e.g., dataloading by using random tensors)?

Otherwise it would suggest that your model is not large/wide enough to full utilize the GPU.

Thanks… but why is the T4 so fast? about 1.5x or 2x times of the V100

It would be difficult to say for sure without knowing the exact workload including e.g., the shapes. But from a high level, both GPU architectures are from roughly the same period and manufactured on TSMC 12nm—so when the workload is not large enough to fully utilize the parallelism available, it may just come down to clock rate, and the T4 has a significantly higher clock rate than V100.

As a contrived example of something that doesn’t have enough parallelism, consider a trivial workload that just adds two scalar values; would it be expected that the V100 is faster than T4 in this case?