Performance issue of RTX 3070 compared to 2070 SUPER

Hello,

I have a flask server running inference for several different models. Two identical computers that only differs in terms of GPU, one RTX3070 and the other 2070 SUPER. Both computers have i9-10900X and nvme2 SSDs and same RAM. Both computers are up to date with every library and pytorch 1.10.0 with 11.3 cuda and 8.2.0 cudnn. So in summary, they are exactly the same with everything else.

However, RTX 3070 performs significantly slower and uses much more VRAM (~7.6GB) than 2070SUPER (~6GB VRAM). 2070SUPER can handle %50 more load and still perform better.

Are there any known issues regarding the performance issues on 3070?

Note: I use Transformers and Attention networks which I heard should run faster on 30** series GPUs.
Note2: I use half precision for CNN models as well. Can the issue be related with fp16?
Note3: I regularly clear gpu cache release some memory, otherwise VRAM keeps bloating under heavy load.
Note4: Flask runs multithreaded if that helps.
Note5: This issue is impossible to reproduce due to complexity of the code.

I’m not sure what “Note 3” means and how you are releasing the memory, but in case you are clearing the cache this would hit the performance.
Note 5 makes it quite hard to give you a valid answer. Would it be possible to get a model with the input shapes showing the performance difference?

If clearing the cache was hitting the performance, wouldn’t it decrease the performance on both of the computers? They are running the same code at the same time, and there is a huge performance difference both in terms of allocated VRAM which leads to reduced inference speed due to lack of gpu VRAM available.

I realized when the gpu VRAM is full while using Flask server, whenever a request is made that requires more space on VRAM, it queues the operation which leads to reduced inference speed.

So the issue is with RTX3070 allocating more VRAM compared to 2070 SUPER.