Hi, I am running experiments on AWS g4dn.metal (has 8 T4 GPUs) instance and p3.16xlarge (has 8 V100 GPUs).
I use same OS (ubuntu 18), same conda environment (python 3.7, pytorch 1.10, cudatookit 11.0). I got wierd problems:
When I run network A (it has conv and FC layers), T4 is as more than 2 times fast as V100. When I run network B (it has conv, FC and attention module). V100 is as about 2 times fast as T4. The network B has much more params than the network A (about 1.5 times params).
Why does this happen? I think V100 should always faster than T4. I tried torch.backends.cudnn.benchmark torch.backends.cudnn.enabled flags to get faster on V100, but no luck. Can anyone help me? Thank you.
Could you share more details such as the benchmark code you are running?
However, I’m not sure there are any guarantees that V100 will outperform T4 on all benchmarks, considering that they are of similar compute capability (7.0 and 7.5 respectively) and T4 potentially has higher boost clocks than V100 on paper. The latter point could be relevant if e.g., your workload is not fully saturating the compute/memory bandwidth of the GPUs. The intuition is that V100 is a “wider” GPU than T4, but if your workload is small, there might not be enough parallelism to sufficiently leverage the hardware.
BTW, I am using parallel model following:
I also check the nvidia-smi. It shows 8 GPUS has workload periodically – It has GPU-util about 10-20% then turns to 0, after a while it goes to 10-20% … in such a cycle. For both T4 and V100 it has the same fashion. But the T4 is much faster (about half running time).
It would be difficult to say for sure without knowing the exact workload including e.g., the shapes. But from a high level, both GPU architectures are from roughly the same period and manufactured on TSMC 12nm—so when the workload is not large enough to fully utilize the parallelism available, it may just come down to clock rate, and the T4 has a significantly higher clock rate than V100.
As a contrived example of something that doesn’t have enough parallelism, consider a trivial workload that just adds two scalar values; would it be expected that the V100 is faster than T4 in this case?