Very poor Batch Sizes on RTX 3090, good Batch Sizes on RTX Titan (Both 24gb cards)

I have a local machine which has an RTX 3090, and also have access to a GPU server with a RTX Titan, both have 24gb VRAM, both have identical amounts of RAM/same CPU’s etc.

I am running the exact same code, installed same versions of PyTorch (Nightly) and all my packages, and yet the Titan is able to have batch sizes of 32 whilst on the 3090 it is unable to go as high as 8 where I get a CUDA out of memory error.

Any suggestions greatly appreciated (first time poster so apologies if I am doing something wrong)
Many Thanks

That shouldn’t be an issue. Have you checked what amount of memory python consumes during batch 8 run? (via nvidia-smi for example)

Hi @my3bikaht,

Many thanks for replying, I tried installing the latest Nvidia Drivers (495.44) on the latest feature branch which seems to be able to increase my batch size to 16 which is at least an improvement and somewhat closer to the cluster.

A batch size of 16 = 16.8gb, batch size of 32 gets an out of memory error locally, but 23888mb on the cluster. I suspect now that as the cluster does not have any display outputs/other processes etc. it has more available VRAM than my local machine and hence the reduced batch size.

It might be useful to show the output of nvidia-smi (which would indicate memory usage by the desktop environment and other potential processes).