As long you can fit a batch size of 1 on the gpu, you can use gradient accumulation.
See here.
As for calculating the GPU size, it’s a bit complicated with models using convolutions. This is because it also depends on your image size, the number and size of layers, the dtype, kernel size, optimizer, model.train() vs. model.eval(), and the batch size.
The chart on this page gives the parameter sizes between various pretrained vision models for Pytorch. That can generally be used as a relative comparison, but you’ll need to use some trial and error, depending on how you set the batch size, image size, etc.