I have a flask server running inference for several different models. Two identical computers that only differs in terms of GPU, one RTX3070 and the other 2070 SUPER. Both computers have i9-10900X and nvme2 SSDs and same RAM. Both computers are up to date with every library and pytorch 1.10.0 with 11.3 cuda and 8.2.0 cudnn. So in summary, they are exactly the same with everything else.
However, RTX 3070 performs significantly slower and uses much more VRAM (~7.6GB) than 2070SUPER (~6GB VRAM). 2070SUPER can handle %50 more load and still perform better.
Are there any known issues regarding the performance issues on 3070?
Note: I use Transformers and Attention networks which I heard should run faster on 30** series GPUs.
Note2: I use half precision for CNN models as well. Can the issue be related with fp16?
Note3: I regularly clear gpu cache release some memory, otherwise VRAM keeps bloating under heavy load.
Note4: Flask runs multithreaded if that helps.
Note5: This issue is impossible to reproduce due to complexity of the code.